Model parameters: d_model 2048 ffw_size 8192 kv_size 128 n_heads 16 n_layers 28 Megatron-DeepSpeed/pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 28 --hidden-size 2048 --num-attention-heads 16 --kv-channels 128 --ffn-hidden-size 8192 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 2 --global-batch-size 256 --train-samples 32_109_839 --vocab-file gpt2/vocab.json --merge-file gpt2/merges.txt --clip-grad 1.0 --kill-switch-path kill-switch-1b5 --bf16 --optimizer adam --adam-beta1 0.9 --adam-beta2 0.999 --adam-eps 1e-8 --lr 2e-4 --min-lr 2e-5 --lr-decay-style cosine --lr-decay-samples 32_109_839 --lr-warmup-samples 321_098 --clip-grad 1.0 --weight-decay 1e-1 --log-interval 10 --save-interval 1000 --eval-interval 1000 --eval-iters 1 --tensorboard-dir tensorboard_1b5 --tensorboard-queue-size 5 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --save checkpoints_1b5 --load checkpoints_1b5 --data-path /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document --data-impl mmap --split 949,50,1 --deepspeed --deepspeed_config ds_configs/2072488.json --zero-stage 0 START 2072488: Fri Nov 25 19:10:01 EET 2022 0: 0: 0: ======================= ROCm System Management Interface ======================= 0: ================================= Concise Info ================================= 0: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0: 0 43.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 1 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 2 40.0c 83.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 3 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 4 41.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: 6 42.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 0: 7 40.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 0: ================================================================================ 0: ============================= End of ROCm SMI Log ============================== 15: 15: 15: ======================= ROCm System Management Interface ======================= 15: ================================= Concise Info ================================= 15: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 15: 0 42.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 1 52.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: 2 39.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: 4 46.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 5 52.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: 6 42.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 15: 7 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 15: ================================================================================ 15: ============================= End of ROCm SMI Log ============================== 14: 14: 14: ======================= ROCm System Management Interface ======================= 14: ================================= Concise Info ================================= 14: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 14: 0 46.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: 2 42.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 3 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: 4 44.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: 6 42.0c 103.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 14: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 14: ================================================================================ 14: ============================= End of ROCm SMI Log ============================== 1: 1: 1: ======================= ROCm System Management Interface ======================= 1: ================================= Concise Info ================================= 1: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1: 0 42.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 2 41.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 4 45.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: 6 40.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 1: 7 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 1: ================================================================================ 1: ============================= End of ROCm SMI Log ============================== 10: 10: 10: ======================= ROCm System Management Interface ======================= 10: ================================= Concise Info ================================= 10: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 10: 0 43.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: 2 41.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: 4 44.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: 6 40.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 10: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 10: ================================================================================ 10: ============================= End of ROCm SMI Log ============================== 9: 9: 9: ======================= ROCm System Management Interface ======================= 9: ================================= Concise Info ================================= 9: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 9: 0 50.0c 85.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: 2 42.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: 4 44.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: 6 44.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 9: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 9: ================================================================================ 9: ============================= End of ROCm SMI Log ============================== 5: 5: 5: ======================= ROCm System Management Interface ======================= 5: ================================= Concise Info ================================= 5: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 5: 0 46.0c 91.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 2 38.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 4 41.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: 6 37.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 5: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 5: ================================================================================ 5: ============================= End of ROCm SMI Log ============================== 2: 2: 2: ======================= ROCm System Management Interface ======================= 2: ================================= Concise Info ================================= 2: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 2: 0 43.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 1 39.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 2 39.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 4 42.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 5 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: 6 39.0c 86.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 2: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 2: ================================================================================ 2: ============================= End of ROCm SMI Log ============================== 13: 13: 13: ======================= ROCm System Management Interface ======================= 13: ================================= Concise Info ================================= 13: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 13: 0 50.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 1 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: 2 45.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 3 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: 4 45.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 5 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: 6 42.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 13: 7 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 13: ================================================================================ 13: ============================= End of ROCm SMI Log ============================== 6: 6: 6: ======================= ROCm System Management Interface ======================= 6: ================================= Concise Info ================================= 6: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 6: 0 49.0c 93.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 2 39.0c 102.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 3 40.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 4 40.0c 92.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 5 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: 6 43.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 6: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 6: ================================================================================ 6: ============================= End of ROCm SMI Log ============================== 4: 4: 4: ======================= ROCm System Management Interface ======================= 4: ================================= Concise Info ================================= 4: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 4: 0 45.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 1 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 2 46.0c 98.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 3 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 4 41.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: 6 40.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 4: 7 40.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 4: ================================================================================ 4: ============================= End of ROCm SMI Log ============================== 11: 11: 11: ======================= ROCm System Management Interface ======================= 11: ================================= Concise Info ================================= 11: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 11: 0 44.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: 2 45.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 3 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: 4 46.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 5 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: 6 42.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 11: 7 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 11: ================================================================================ 11: ============================= End of ROCm SMI Log ============================== 8: 8: 8: ======================= ROCm System Management Interface ======================= 8: ================================= Concise Info ================================= 8: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 8: 0 46.0c 90.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 1 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: 2 42.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 3 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: 4 41.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 5 49.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: 6 40.0c 96.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 8: 7 42.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 8: ================================================================================ 8: ============================= End of ROCm SMI Log ============================== 12: 12: 12: ======================= ROCm System Management Interface ======================= 12: ================================= Concise Info ================================= 12: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 12: 0 41.0c 97.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 1 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: 2 39.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 3 41.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: 4 43.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 5 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: 6 45.0c 89.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 12: 7 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 12: ================================================================================ 12: ============================= End of ROCm SMI Log ============================== 7: 7: 7: ======================= ROCm System Management Interface ======================= 7: ================================= Concise Info ================================= 7: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 7: 0 48.0c 97.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 1 50.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 2 45.0c 94.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 3 45.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 4 41.0c 88.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 5 47.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: 6 44.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 7: 7 48.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 7: ================================================================================ 7: ============================= End of ROCm SMI Log ============================== 3: 3: 3: ======================= ROCm System Management Interface ======================= 3: ================================= Concise Info ================================= 3: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 3: 0 47.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 1 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 2 45.0c 87.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 3 46.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 4 46.0c 95.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 5 44.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: 6 45.0c 84.0W 800Mhz 1600Mhz 0% auto 560.0W 0% 0% 3: 7 43.0c N/A 800Mhz 1600Mhz 0% auto 0.0W 0% 0% 3: ================================================================================ 3: ============================= End of ROCm SMI Log ============================== 10: Launching on nid007227 (10/16), master nid006104 port 9999, GPUs 8, CUDA: True 0: Launching on nid006104 (0/16), master nid006104 port 9999, GPUs 8, CUDA: True 5: Launching on nid006690 (5/16), master nid006104 port 9999, GPUs 8, CUDA: True 9: Launching on nid007219 (9/16), master nid006104 port 9999, GPUs 8, CUDA: True 14: Launching on nid007548 (14/16), master nid006104 port 9999, GPUs 8, CUDA: True 4: Launching on nid006656 (4/16), master nid006104 port 9999, GPUs 8, CUDA: True 2: Launching on nid006244 (2/16), master nid006104 port 9999, GPUs 8, CUDA: True 3: Launching on nid006286 (3/16), master nid006104 port 9999, GPUs 8, CUDA: True 12: Launching on nid007496 (12/16), master nid006104 port 9999, GPUs 8, CUDA: True 8: Launching on nid007208 (8/16), master nid006104 port 9999, GPUs 8, CUDA: True 15: Launching on nid007573 (15/16), master nid006104 port 9999, GPUs 8, CUDA: True 1: Launching on nid006113 (1/16), master nid006104 port 9999, GPUs 8, CUDA: True 7: Launching on nid006703 (7/16), master nid006104 port 9999, GPUs 8, CUDA: True 11: Launching on nid007250 (11/16), master nid006104 port 9999, GPUs 8, CUDA: True 6: Launching on nid006698 (6/16), master nid006104 port 9999, GPUs 8, CUDA: True 13: Launching on nid007503 (13/16), master nid006104 port 9999, GPUs 8, CUDA: True 0: using world size: 128, data-parallel-size: 128, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 0: accumulate and all-reduce gradients in fp32 for bfloat16 data type. 0: using torch.bfloat16 for parameters ... 0: ------------------------ arguments ------------------------ 0: abort_on_unmet_fused_kernel_constraints ......... False 0: accumulate_allreduce_grads_in_fp32 .............. True 0: adam_beta1 ...................................... 0.9 0: adam_beta2 ...................................... 0.999 0: adam_eps ........................................ 1e-08 0: adlr_autoresume ................................. False 0: adlr_autoresume_interval ........................ 1000 0: apply_query_key_layer_scaling ................... True 0: apply_residual_connection_post_layernorm ........ False 0: attention_dropout ............................... 0.1 0: attention_softmax_in_fp32 ....................... False 0: bert_binary_head ................................ True 0: bert_load ....................................... None 0: bf16 ............................................ True 0: bias_dropout_fusion ............................. True 0: bias_gelu_fusion ................................ True 0: biencoder_projection_dim ........................ 0 0: biencoder_shared_query_context_model ............ False 0: block_data_path ................................. None 0: checkpoint_activations .......................... False 0: checkpoint_in_cpu ............................... False 0: checkpoint_num_layers ........................... 1 0: clip_grad ....................................... 1.0 0: codecarbon_dir .................................. None 0: consumed_train_samples .......................... 0 0: consumed_train_tokens ........................... 0 0: consumed_valid_samples .......................... 0 0: contigious_checkpointing ........................ False 0: cpu_optimizer ................................... False 0: cpu_torch_adam .................................. False 0: curriculum_learning ............................. False 0: data_impl ....................................... mmap 0: data_parallel_size .............................. 128 0: data_path ....................................... ['/scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document'] 0: dataloader_type ................................. single 0: DDP_impl ........................................ local 0: decoder_seq_length .............................. None 0: deepscale ....................................... False 0: deepscale_config ................................ None 0: deepspeed ....................................... True 0: deepspeed_activation_checkpointing .............. False 0: deepspeed_config ................................ ds_configs/2072488.json 0: deepspeed_mpi ................................... False 0: distribute_checkpointed_activations ............. False 0: distributed_backend ............................. nccl 0: embed_layernorm ................................. False 0: embedding_path .................................. None 0: encoder_seq_length .............................. 2048 0: eod_mask_loss ................................... False 0: eval_interval ................................... 1000 0: eval_iters ...................................... 1 0: eval_only ....................................... None 0: evidence_data_path .............................. None 0: exit_duration_in_mins ........................... None 0: exit_interval ................................... None 0: ffn_hidden_size ................................. 8192 0: finetune ........................................ False 0: fp16 ............................................ False 0: fp16_lm_cross_entropy ........................... False 0: fp32_residual_connection ........................ False 0: gigaflos_no_embeds .............................. 0 0: global_batch_size ............................... 256 0: glu_activation .................................. None 0: hidden_dropout .................................. 0.1 0: hidden_size ..................................... 2048 0: hysteresis ...................................... 2 0: ict_head_size ................................... None 0: ict_load ........................................ None 0: img_dim ......................................... 224 0: indexer_batch_size .............................. 128 0: indexer_log_interval ............................ 1000 0: inference ....................................... False 0: init_method_std ................................. 0.02 0: init_method_xavier_uniform ...................... False 0: initial_loss_scale .............................. 4294967296 0: kill_switch_path ................................ kill-switch-1b5 0: kv_channels ..................................... 128 0: layer_norm_fusion ............................... True 0: layernorm_epsilon ............................... 1e-05 0: lazy_mpu_init ................................... None 0: load ............................................ checkpoints_1b5 0: local_rank ...................................... None 0: log_batch_size_to_tensorboard ................... True 0: log_interval .................................... 10 0: log_learning_rate_to_tensorboard ................ True 0: log_level ....................................... None 0: log_level_replica ............................... None 0: log_loss_scale_to_tensorboard ................... True 0: log_num_zeros_in_grad ........................... False 0: log_params_norm ................................. False 0: log_path ........................................ None 0: log_timers_to_tensorboard ....................... True 0: log_validation_ppl_to_tensorboard ............... True 0: loss_on_targets_only ............................ False 0: loss_scale ...................................... None 0: loss_scale_window ............................... 1000 0: lr .............................................. 0.0002 0: lr_decay_iters .................................. None 0: lr_decay_samples ................................ 32109839 0: lr_decay_style .................................. cosine 0: lr_decay_tokens ................................. None 0: lr_warmup_fraction .............................. None 0: lr_warmup_iters ................................. 0 0: lr_warmup_samples ............................... 321098 0: make_vocab_size_divisible_by .................... 128 0: mask_prob ....................................... 0.15 0: masked_softmax_fusion ........................... True 0: max_position_embeddings ......................... 2048 0: mean_noise_span_length .......................... None 0: memory_centric_tiled_linear ..................... False 0: merge_file ...................................... gpt2/merges.txt 0: micro_batch_size ................................ 2 0: min_loss_scale .................................. 1.0 0: min_lr .......................................... 2e-05 0: mmap_warmup ..................................... False 0: no_load_optim ................................... None 0: no_load_rng ..................................... None 0: no_save_optim ................................... None 0: no_save_rng ..................................... None 0: noise_density ................................... None 0: num_attention_heads ............................. 16 0: num_channels .................................... 3 0: num_classes ..................................... 1000 0: num_layers ...................................... 28 0: num_layers_per_virtual_pipeline_stage ........... None 0: num_workers ..................................... 2 0: onnx_safe ....................................... None 0: openai_gelu ..................................... False 0: optimizer ....................................... adam 0: optimizer_fusion ................................ True 0: override_lr_scheduler ........................... False 0: pad_vocab_size_to ............................... None 0: params_dtype .................................... torch.bfloat16 0: partition_activations ........................... False 0: patch_dim ....................................... 16 0: pipeline_model_parallel_size .................... 1 0: position_embedding_type ......................... PositionEmbeddingType.absolute 0: pp_partition_method ............................. None 0: profile_backward ................................ False 0: query_in_block_prob ............................. 0.1 0: rampup_batch_size ............................... None 0: rank ............................................ 0 0: remote_device ................................... none 0: reset_attention_mask ............................ False 0: reset_position_ids .............................. False 0: retriever_report_topk_accuracies ................ [] 0: retriever_score_scaling ......................... False 0: retriever_seq_length ............................ 256 0: reweight_loss_based_on_position_frequency ....... False 0: sample_rate ..................................... 1.0 0: save ............................................ checkpoints_1b5 0: save_interval ................................... 1000 0: scatter_gather_tensors_in_pipeline .............. True 0: scattered_embeddings ............................ False 0: seed ............................................ 1234 0: seq_length ...................................... 2048 0: sgd_momentum .................................... 0.9 0: short_seq_prob .................................. 0.1 0: skip_train_iteration_range ...................... None 0: split ........................................... 949,50,1 0: split_transformers .............................. False 0: sync_tp_duplicated_parameters ................... False 0: synchronize_each_layer .......................... False 0: tensor_model_parallel_size ...................... 1 0: tensorboard_dir ................................. tensorboard_1b5 0: tensorboard_log_interval ........................ 1 0: tensorboard_queue_size .......................... 5 0: test_weighted_split_names ....................... None 0: test_weighted_split_paths ....................... None 0: test_weighted_split_paths_path .................. None 0: test_weighted_split_splits ...................... None 0: test_weighted_split_weights ..................... None 0: tile_factor ..................................... 1 0: titles_data_path ................................ None 0: tokenizer_name_or_path .......................... None 0: tokenizer_type .................................. GPT2BPETokenizer 0: train_iters ..................................... None 0: train_samples ................................... 32109839 0: train_tokens .................................... None 0: train_weighted_split_paths ...................... None 0: train_weighted_split_paths_path ................. None 0: universal_checkpoint ............................ False 0: use_bnb_optimizer ............................... False 0: use_checkpoint_lr_scheduler ..................... False 0: use_contiguous_buffers_in_ddp ................... True 0: use_cpu_initialization .......................... None 0: use_one_sent_docs ............................... False 0: use_pin_memory .................................. False 0: valid_num_workers ............................... 2 0: valid_weighted_split_names ...................... None 0: valid_weighted_split_paths ...................... None 0: valid_weighted_split_paths_path ................. None 0: valid_weighted_split_splits ..................... None 0: valid_weighted_split_weights .................... None 0: virtual_pipeline_model_parallel_size ............ None 0: vocab_extra_ids ................................. 0 0: vocab_file ...................................... gpt2/vocab.json 0: weight_decay .................................... 0.1 0: world_size ...................................... 128 0: zero_allgather_bucket_size ...................... 0.0 0: zero_contigious_gradients ....................... False 0: zero_reduce_bucket_size ......................... 0.0 0: zero_reduce_scatter ............................. False 0: zero_stage ...................................... 0 0: -------------------- end of arguments --------------------- 0: setting number of micro-batches to constant 1 0: > building GPT2BPETokenizer tokenizer ... 0: > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) 0: DeepSpeed general environment info: 0: torch install path ............... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch'] 0: torch version .................... 1.13.0+rocm5.2 0: torch cuda version ............... None 0: torch hip version ................ 5.2.21151-afdc89f8 0: nvcc version ..................... None 0: deepspeed install path ........... ['/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed'] 0: deepspeed info ................... 0.7.5, unknown, unknown 0: deepspeed wheel compiled w. ...... torch 1.13, hip 5.1 0: **** Git info for Megatron: git_hash=unknown git_branch=unknown **** 0: > initializing torch distributed ... 0: [2022-11-25 19:11:33,153] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 15: > setting tensorboard ... 0: > initializing tensor model parallel with size 1 0: > initializing pipeline model parallel with size 1 0: > setting random seeds to 1234 ... 0: > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 0: > compiling dataset index builder ... 0: make: Entering directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data' 0: make: Nothing to be done for 'default'. 0: make: Leaving directory '/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/data' 0: >>> done with dataset index builder. Compilation time: 0.094 seconds 0: > compiling and loading fused kernels ... 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.cpp [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_cuda.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 87 0: [1/1] c++ scaled_upper_triang_masked_softmax_hip.cuda.o scaled_upper_triang_masked_softmax_hip.o -shared -L/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/lib -lc10 -lc10_hip -ltorch_cpu -ltorch_hip -ltorch -ltorch_python -L/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib -lamdhip64 -o scaled_upper_triang_masked_softmax_cuda.so 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.cpp [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_cuda.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 63 0: ninja: no work to do. 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda.cpp -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda.cpp [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_cuda_kernel.cu -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/layer_norm_hip_kernel.hip [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/type_shim.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/compat.h [skipped, no changes] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_upper_triang_masked_softmax_hip.h [skipped, already hipified] 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax.h -> /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/scaled_masked_softmax_hip.h [skipped, already hipified] 0: Total number of unsupported CUDA function calls: 0 0: 0: 0: Total number of replaced kernel launches: 67 0: ninja: no work to do. 0: >>> done with compiling and loading fused kernels. Compilation time: 20.853 seconds 0: time to initialize megatron (seconds): -29.430 0: [after megatron is initialized] datetime: 2022-11-25 19:12:02 0: building GPT model ... 0: [2022-11-25 19:12:02,629] [INFO] [utils.py:827:see_memory_usage] Before Building Model 0: [2022-11-25 19:12:02,630] [INFO] [utils.py:828:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB 0: [2022-11-25 19:12:02,630] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 28.51 GB, percent = 5.7% 0: SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None 0: Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3, ProcessCoord(pipe=0, data=4, model=0): 4, ProcessCoord(pipe=0, data=5, model=0): 5, ProcessCoord(pipe=0, data=6, model=0): 6, ProcessCoord(pipe=0, data=7, model=0): 7, ProcessCoord(pipe=0, data=8, model=0): 8, ProcessCoord(pipe=0, data=9, model=0): 9, ProcessCoord(pipe=0, data=10, model=0): 10, ProcessCoord(pipe=0, data=11, model=0): 11, ProcessCoord(pipe=0, data=12, model=0): 12, ProcessCoord(pipe=0, data=13, model=0): 13, ProcessCoord(pipe=0, data=14, model=0): 14, ProcessCoord(pipe=0, data=15, model=0): 15, ProcessCoord(pipe=0, data=16, model=0): 16, ProcessCoord(pipe=0, data=17, model=0): 17, ProcessCoord(pipe=0, data=18, model=0): 18, ProcessCoord(pipe=0, data=19, model=0): 19, ProcessCoord(pipe=0, data=20, model=0): 20, ProcessCoord(pipe=0, data=21, model=0): 21, ProcessCoord(pipe=0, data=22, model=0): 22, ProcessCoord(pi 0: pe=0, data=23, model=0): 23, ProcessCoord(pipe=0, data=24, model=0): 24, ProcessCoord(pipe=0, data=25, model=0): 25, ProcessCoord(pipe=0, data=26, model=0): 26, ProcessCoord(pipe=0, data=27, model=0): 27, ProcessCoord(pipe=0, data=28, model=0): 28, ProcessCoord(pipe=0, data=29, model=0): 29, ProcessCoord(pipe=0, data=30, model=0): 30, ProcessCoord(pipe=0, data=31, model=0): 31, ProcessCoord(pipe=0, data=32, model=0): 32, ProcessCoord(pipe=0, data=33, model=0): 33, ProcessCoord(pipe=0, data=34, model=0): 34, ProcessCoord(pipe=0, data=35, model=0): 35, ProcessCoord(pipe=0, data=36, model=0): 36, ProcessCoord(pipe=0, data=37, model=0): 37, ProcessCoord(pipe=0, data=38, model=0): 38, ProcessCoord(pipe=0, data=39, model=0): 39, ProcessCoord(pipe=0, data=40, model=0): 40, ProcessCoord(pipe=0, data=41, model=0): 41, ProcessCoord(pipe=0, data=42, model=0): 42, ProcessCoord(pipe=0, data=43, model=0): 43, ProcessCoord(pipe=0, data=44, model=0): 44, ProcessCoord(pipe=0, data=45, model=0): 45, ProcessCoord(pipe=0, data=4 0: 6, model=0): 46, ProcessCoord(pipe=0, data=47, model=0): 47, ProcessCoord(pipe=0, data=48, model=0): 48, ProcessCoord(pipe=0, data=49, model=0): 49, ProcessCoord(pipe=0, data=50, model=0): 50, ProcessCoord(pipe=0, data=51, model=0): 51, ProcessCoord(pipe=0, data=52, model=0): 52, ProcessCoord(pipe=0, data=53, model=0): 53, ProcessCoord(pipe=0, data=54, model=0): 54, ProcessCoord(pipe=0, data=55, model=0): 55, ProcessCoord(pipe=0, data=56, model=0): 56, ProcessCoord(pipe=0, data=57, model=0): 57, ProcessCoord(pipe=0, data=58, model=0): 58, ProcessCoord(pipe=0, data=59, model=0): 59, ProcessCoord(pipe=0, data=60, model=0): 60, ProcessCoord(pipe=0, data=61, model=0): 61, ProcessCoord(pipe=0, data=62, model=0): 62, ProcessCoord(pipe=0, data=63, model=0): 63, ProcessCoord(pipe=0, data=64, model=0): 64, ProcessCoord(pipe=0, data=65, model=0): 65, ProcessCoord(pipe=0, data=66, model=0): 66, ProcessCoord(pipe=0, data=67, model=0): 67, ProcessCoord(pipe=0, data=68, model=0): 68, ProcessCoord(pipe=0, data=69, model=0): 0: 69, ProcessCoord(pipe=0, data=70, model=0): 70, ProcessCoord(pipe=0, data=71, model=0): 71, ProcessCoord(pipe=0, data=72, model=0): 72, ProcessCoord(pipe=0, data=73, model=0): 73, ProcessCoord(pipe=0, data=74, model=0): 74, ProcessCoord(pipe=0, data=75, model=0): 75, ProcessCoord(pipe=0, data=76, model=0): 76, ProcessCoord(pipe=0, data=77, model=0): 77, ProcessCoord(pipe=0, data=78, model=0): 78, ProcessCoord(pipe=0, data=79, model=0): 79, ProcessCoord(pipe=0, data=80, model=0): 80, ProcessCoord(pipe=0, data=81, model=0): 81, ProcessCoord(pipe=0, data=82, model=0): 82, ProcessCoord(pipe=0, data=83, model=0): 83, ProcessCoord(pipe=0, data=84, model=0): 84, ProcessCoord(pipe=0, data=85, model=0): 85, ProcessCoord(pipe=0, data=86, model=0): 86, ProcessCoord(pipe=0, data=87, model=0): 87, ProcessCoord(pipe=0, data=88, model=0): 88, ProcessCoord(pipe=0, data=89, model=0): 89, ProcessCoord(pipe=0, data=90, model=0): 90, ProcessCoord(pipe=0, data=91, model=0): 91, ProcessCoord(pipe=0, data=92, model=0): 92, Process 0: Coord(pipe=0, data=93, model=0): 93, ProcessCoord(pipe=0, data=94, model=0): 94, ProcessCoord(pipe=0, data=95, model=0): 95, ProcessCoord(pipe=0, data=96, model=0): 96, ProcessCoord(pipe=0, data=97, model=0): 97, ProcessCoord(pipe=0, data=98, model=0): 98, ProcessCoord(pipe=0, data=99, model=0): 99, ProcessCoord(pipe=0, data=100, model=0): 100, ProcessCoord(pipe=0, data=101, model=0): 101, ProcessCoord(pipe=0, data=102, model=0): 102, ProcessCoord(pipe=0, data=103, model=0): 103, ProcessCoord(pipe=0, data=104, model=0): 104, ProcessCoord(pipe=0, data=105, model=0): 105, ProcessCoord(pipe=0, data=106, model=0): 106, ProcessCoord(pipe=0, data=107, model=0): 107, ProcessCoord(pipe=0, data=108, model=0): 108, ProcessCoord(pipe=0, data=109, model=0): 109, ProcessCoord(pipe=0, data=110, model=0): 110, ProcessCoord(pipe=0, data=111, model=0): 111, ProcessCoord(pipe=0, data=112, model=0): 112, ProcessCoord(pipe=0, data=113, model=0): 113, ProcessCoord(pipe=0, data=114, model=0): 114, ProcessCoord(pipe=0, data=115, mo 0: del=0): 115, ProcessCoord(pipe=0, data=116, model=0): 116, ProcessCoord(pipe=0, data=117, model=0): 117, ProcessCoord(pipe=0, data=118, model=0): 118, ProcessCoord(pipe=0, data=119, model=0): 119, ProcessCoord(pipe=0, data=120, model=0): 120, ProcessCoord(pipe=0, data=121, model=0): 121, ProcessCoord(pipe=0, data=122, model=0): 122, ProcessCoord(pipe=0, data=123, model=0): 123, ProcessCoord(pipe=0, data=124, model=0): 124, ProcessCoord(pipe=0, data=125, model=0): 125, ProcessCoord(pipe=0, data=126, model=0): 126, ProcessCoord(pipe=0, data=127, model=0): 127} 0: [2022-11-25 19:12:06,833] [INFO] [module.py:366:_partition_layers] Partitioning pipeline stages with method type:transformer 0: stage=0 layers=35 0: 0: _to_float16 0: 1: EmbeddingPipe 0: 2: 0: 3: ParallelTransformerLayerPipe 0: 4: ParallelTransformerLayerPipe 0: 5: ParallelTransformerLayerPipe 0: 6: ParallelTransformerLayerPipe 0: 7: ParallelTransformerLayerPipe 0: 8: ParallelTransformerLayerPipe 0: 9: ParallelTransformerLayerPipe 0: 10: ParallelTransformerLayerPipe 0: 11: ParallelTransformerLayerPipe 0: 12: ParallelTransformerLayerPipe 0: 13: ParallelTransformerLayerPipe 0: 14: ParallelTransformerLayerPipe 0: 15: ParallelTransformerLayerPipe 0: 16: ParallelTransformerLayerPipe 0: 17: ParallelTransformerLayerPipe 0: 18: ParallelTransformerLayerPipe 0: 19: ParallelTransformerLayerPipe 0: 20: ParallelTransformerLayerPipe 0: 21: ParallelTransformerLayerPipe 0: 22: ParallelTransformerLayerPipe 0: 23: ParallelTransformerLayerPipe 0: 24: ParallelTransformerLayerPipe 0: 25: ParallelTransformerLayerPipe 0: 26: ParallelTransformerLayerPipe 0: 27: ParallelTransformerLayerPipe 0: 28: ParallelTransformerLayerPipe 0: 29: ParallelTransformerLayerPipe 0: 30: ParallelTransformerLayerPipe 0: 31: undo 0: 32: MixedFusedLayerNorm 0: 33: EmbeddingPipe 0: 34: float16_to_fp32 0: loss: CrossEntropy 0: [2022-11-25 19:12:07,187] [INFO] [utils.py:827:see_memory_usage] After Building Model 0: [2022-11-25 19:12:07,188] [INFO] [utils.py:828:see_memory_usage] MA 2.83 GB Max_MA 2.83 GB CA 2.89 GB Max_CA 3 GB 0: [2022-11-25 19:12:07,188] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 28.55 GB, percent = 5.7% 0: setting training iterations to 125429 0: > learning rate decay style: cosine 0: DeepSpeed is enabled. 0: [2022-11-25 19:12:07,190] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.5, git-hash=unknown, git-branch=unknown 0: [2022-11-25 19:12:23,141] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False 0: [2022-11-25 19:12:23,142] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer 0: [2022-11-25 19:12:23,142] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer 0: [2022-11-25 19:12:23,155] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam 0: [2022-11-25 19:12:23,155] [INFO] [logging.py:68:log_dist] [Rank 0] Creating BF16 optimizer 0: [2022-11-25 19:12:23,206] [INFO] [utils.py:827:see_memory_usage] begin bf16_optimizer 0: [2022-11-25 19:12:23,206] [INFO] [utils.py:828:see_memory_usage] MA 2.83 GB Max_MA 2.84 GB CA 2.91 GB Max_CA 3 GB 0: [2022-11-25 19:12:23,207] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.23 GB, percent = 5.8% 3: ninja: no work to do. 3: Time to load utils op: 0.2253568172454834 seconds 0: ninja: no work to do. 0: Time to load utils op: 0.12015628814697266 seconds 5: Time to load utils op: 0.11010336875915527 seconds 5: Time to load utils op: 0.11010122299194336 secondsTime to load utils op: 0.11010932922363281 seconds 5: 5: Time to load utils op: 0.11013221740722656 secondsTime to load utils op: 0.11013174057006836 seconds 5: 5: Time to load utils op: 0.11013603210449219 seconds 5: Time to load utils op: 0.11014056205749512 secondsTime to load utils op: 0.11014151573181152 seconds 5: 9: Time to load utils op: 0.10982537269592285 secondsTime to load utils op: 0.10995674133300781 seconds 9: 9: Time to load utils op: 0.10993218421936035 seconds 9: Time to load utils op: 0.10994553565979004 seconds 9: Time to load utils op: 0.10966038703918457 seconds 9: Time to load utils op: 0.1100320816040039 seconds 9: Time to load utils op: 0.10994243621826172 seconds 8: Time to load utils op: 0.10945963859558105 seconds 8: Time to load utils op: 0.10946488380432129 seconds 8: Time to load utils op: 0.1094975471496582 secondsTime to load utils op: 0.10945010185241699 seconds 8: 8: Time to load utils op: 0.10950970649719238 secondsTime to load utils op: 0.10951066017150879 seconds 8: 8: Time to load utils op: 0.10952115058898926 seconds 8: Time to load utils op: 0.10953021049499512 seconds 12: Time to load utils op: 0.10868453979492188 seconds 12: Time to load utils op: 0.1088876724243164 seconds 12: Time to load utils op: 0.1088566780090332 seconds 12: Time to load utils op: 0.10883069038391113 seconds 12: Time to load utils op: 0.10850691795349121 secondsTime to load utils op: 0.10855770111083984 seconds 12: Time to load utils op: 0.10852646827697754 seconds 12: 6: Time to load utils op: 0.11107540130615234 seconds 6: Time to load utils op: 0.11107826232910156 secondsTime to load utils op: 0.11108970642089844 seconds 6: 6: Time to load utils op: 0.11109089851379395 seconds 6: Time to load utils op: 0.11109805107116699 seconds 6: Time to load utils op: 0.11113786697387695 secondsTime to load utils op: 0.11113691329956055 seconds 6: 6: Time to load utils op: 0.11114907264709473 seconds 13: Time to load utils op: 0.1094517707824707 secondsTime to load utils op: 0.10923171043395996 seconds 13: 13: Time to load utils op: 0.10952353477478027 seconds 13: Time to load utils op: 0.10920000076293945 seconds 13: Time to load utils op: 0.10965633392333984 seconds 13: Time to load utils op: 0.10920023918151855 seconds 13: Time to load utils op: 0.10922384262084961 seconds 10: Time to load utils op: 0.11022067070007324 secondsTime to load utils op: 0.11022591590881348 seconds 10: 10: Time to load utils op: 0.11023187637329102 seconds 10: Time to load utils op: 0.11023283004760742 secondsTime to load utils op: 0.11023688316345215 secondsTime to load utils op: 0.11024117469787598 seconds 10: 10: 10: Time to load utils op: 0.11025357246398926 seconds 10: Time to load utils op: 0.11025691032409668 seconds 15: Time to load utils op: 0.11031031608581543 seconds 15: Time to load utils op: 0.10874176025390625 seconds 15: Time to load utils op: 0.10883951187133789 seconds 15: Time to load utils op: 0.11011123657226562 seconds 15: Time to load utils op: 0.10893011093139648 seconds 15: Time to load utils op: 0.10996103286743164 secondsTime to load utils op: 0.10961580276489258 seconds 15: 15: Time to load utils op: 0.11033082008361816 seconds 11: Time to load utils op: 0.11115813255310059 seconds 11: Time to load utils op: 0.11116337776184082 seconds 11: Time to load utils op: 0.11118459701538086 seconds 11: Time to load utils op: 0.11119890213012695 secondsTime to load utils op: 0.11119604110717773 seconds 11: Time to load utils op: 0.11120223999023438 seconds 11: 11: Time to load utils op: 0.11120891571044922 seconds 11: Time to load utils op: 0.11125755310058594 seconds 14: Time to load utils op: 0.10909414291381836 secondsTime to load utils op: 0.10909175872802734 seconds 14: 14: Time to load utils op: 0.10910367965698242 seconds 14: Time to load utils op: 0.10910701751708984 seconds 14: Time to load utils op: 0.10910987854003906 secondsTime to load utils op: 0.10910964012145996 seconds 14: Time to load utils op: 0.10911154747009277 seconds 14: 14: Time to load utils op: 0.1091318130493164 seconds 12: Time to load utils op: 0.3038933277130127 seconds 0: Time to load utils op: 0.3042416572570801 seconds 13: Time to load utils op: 0.30399084091186523 seconds 9: Time to load utils op: 0.3045938014984131 seconds 4: Time to load utils op: 0.3113100528717041 seconds 7: Time to load utils op: 0.3112506866455078 seconds 0: Time to load utils op: 0.20254731178283691 seconds 0: Time to load utils op: 0.20247745513916016 seconds 0: Time to load utils op: 0.2026834487915039 seconds 0: Time to load utils op: 0.20315861701965332 secondsTime to load utils op: 0.20326542854309082 seconds 0: 0: Time to load utils op: 0.2026526927947998 seconds 3: Time to load utils op: 0.2034006118774414 seconds 3: Time to load utils op: 0.20268774032592773 seconds 3: Time to load utils op: 0.20251107215881348 seconds 3: Time to load utils op: 0.20253753662109375 seconds 3: Time to load utils op: 0.20206093788146973 seconds 3: Time to load utils op: 0.20169281959533691 seconds 3: Time to load utils op: 0.2017059326171875 seconds 4: Time to load utils op: 0.2037370204925537 seconds 4: Time to load utils op: 0.20375752449035645 seconds 4: Time to load utils op: 0.20391297340393066 seconds 4: Time to load utils op: 0.20383501052856445 seconds 4: Time to load utils op: 0.20311331748962402 seconds 4: Time to load utils op: 0.20324254035949707 seconds 4: Time to load utils op: 0.20323562622070312 seconds 0: [2022-11-25 19:12:23,558] [INFO] [utils.py:827:see_memory_usage] before initializing group 0 0: [2022-11-25 19:12:23,558] [INFO] [utils.py:828:see_memory_usage] MA 2.83 GB Max_MA 2.83 GB CA 2.91 GB Max_CA 3 GB 0: [2022-11-25 19:12:23,559] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.23 GB, percent = 5.8% 7: Time to load utils op: 0.2028491497039795 seconds 7: Time to load utils op: 0.2028064727783203 seconds 7: Time to load utils op: 0.20258545875549316 seconds 7: Time to load utils op: 0.20258116722106934 seconds 7: Time to load utils op: 0.2038109302520752 secondsTime to load utils op: 0.20361018180847168 seconds 7: 7: Time to load utils op: 0.20352482795715332 seconds 3: Time to load utils op: 0.0005812644958496094 seconds 15: Time to load utils op: 0.0008883476257324219 seconds 15: Time to load utils op: 0.0007770061492919922 seconds 0: Time to load utils op: 0.0007276535034179688 seconds 15: Time to load utils op: 0.0012066364288330078 seconds 15: Time to load utils op: 0.0012123584747314453 seconds 15: Time to load utils op: 0.0012695789337158203 seconds 15: Time to load utils op: 0.0011682510375976562 seconds 15: Time to load utils op: 0.0011706352233886719 seconds 15: Time to load utils op: 0.0012006759643554688 seconds 14: Time to load utils op: 0.0013556480407714844 seconds 14: Time to load utils op: 0.0013077259063720703 seconds 14: Time to load utils op: 0.0014982223510742188 seconds 14: Time to load utils op: 0.0014836788177490234 seconds 14: Time to load utils op: 0.001535177230834961 seconds 14: Time to load utils op: 0.001569986343383789 secondsTime to load utils op: 0.0015687942504882812 seconds 14: 14: Time to load utils op: 0.0016016960144042969 seconds 13: Time to load utils op: 0.0010001659393310547 seconds 13: Time to load utils op: 0.0012502670288085938 seconds 13: Time to load utils op: 0.0014584064483642578 seconds 13: Time to load utils op: 0.0013587474822998047 seconds 13: Time to load utils op: 0.0013344287872314453 seconds 13: Time to load utils op: 0.0014743804931640625 seconds 13: Time to load utils op: 0.001384735107421875 seconds 13: Time to load utils op: 0.0013670921325683594 seconds 11: Time to load utils op: 0.0008182525634765625 seconds 9: Time to load utils op: 0.000461578369140625 seconds 1: Time to load utils op: 0.21238040924072266 seconds 1: Time to load utils op: 0.21238970756530762 seconds 9: Time to load utils op: 0.0004162788391113281 seconds 1: Time to load utils op: 0.21243000030517578 seconds 1: Time to load utils op: 0.21244263648986816 secondsTime to load utils op: 0.21243929862976074 seconds 1: 9: Time to load utils op: 0.0004680156707763672 seconds 9: Time to load utils op: 0.0004401206970214844 seconds 1: Time to load utils op: 0.21245360374450684 secondsTime to load utils op: 0.21245050430297852 seconds 1: 1: Time to load utils op: 0.2124638557434082 seconds 9: Time to load utils op: 0.0004100799560546875 secondsTime to load utils op: 0.0004012584686279297 secondsTime to load utils op: 0.00042057037353515625 seconds 9: 9: 9: Time to load utils op: 0.0004131793975830078 seconds 2: Time to load utils op: 0.2121419906616211 seconds 2: Time to load utils op: 0.21214747428894043 seconds 11: Time to load utils op: 0.0011599063873291016 seconds 11: Time to load utils op: 0.0012099742889404297 secondsTime to load utils op: 0.0011661052703857422 seconds 11: 11: Time to load utils op: 0.001138448715209961 seconds 11: Time to load utils op: 0.0011820793151855469 seconds 2: Time to load utils op: 0.21216464042663574 seconds 2: Time to load utils op: 0.2121877670288086 seconds 11: Time to load utils op: 0.0011594295501708984 seconds 2: Time to load utils op: 0.21219587326049805 seconds 11: Time to load utils op: 0.0012218952178955078 seconds 2: Time to load utils op: 0.21220779418945312 seconds 2: Time to load utils op: 0.212202787399292 secondsTime to load utils op: 0.2122032642364502 seconds 2: 5: Time to load utils op: 0.0008730888366699219 seconds 5: Time to load utils op: 0.000997304916381836 seconds 5: Time to load utils op: 0.0011723041534423828 secondsTime to load utils op: 0.0011072158813476562 seconds 5: 5: Time to load utils op: 0.0010929107666015625 seconds 5: Time to load utils op: 0.0011050701141357422 seconds 5: Time to load utils op: 0.0011107921600341797 seconds 5: Time to load utils op: 0.0011970996856689453 seconds 0: Time to load utils op: 0.000331878662109375 seconds 0: Time to load utils op: 0.0004379749298095703 seconds 0: Time to load utils op: 0.00042176246643066406 seconds 0: Time to load utils op: 0.0004105567932128906 seconds 3: Time to load utils op: 0.00039887428283691406 seconds 3: Time to load utils op: 0.00037217140197753906 seconds 0: Time to load utils op: 0.0004029273986816406 seconds 0: Time to load utils op: 0.0004048347473144531 seconds 3: Time to load utils op: 0.000347137451171875 seconds 3: Time to load utils op: 0.00035953521728515625 seconds 3: Time to load utils op: 0.0003209114074707031 seconds 3: Time to load utils op: 0.0003669261932373047 seconds 3: Time to load utils op: 0.0003883838653564453 seconds 8: Time to load utils op: 0.0007698535919189453 seconds 8: Time to load utils op: 0.0009732246398925781 seconds 8: Time to load utils op: 0.000985860824584961 secondsTime to load utils op: 0.0010449886322021484 seconds 8: 8: Time to load utils op: 0.001184701919555664 seconds 8: Time to load utils op: 0.001085519790649414 seconds 8: Time to load utils op: 0.0010077953338623047 seconds 8: Time to load utils op: 0.0010783672332763672 seconds 6: Time to load utils op: 0.0007038116455078125 seconds 12: Time to load utils op: 0.0007162094116210938 seconds 10: Time to load utils op: 0.0008645057678222656 seconds 6: Time to load utils op: 0.0010890960693359375 secondsTime to load utils op: 0.0009794235229492188 seconds 6: 10: Time to load utils op: 0.0012164115905761719 seconds 10: Time to load utils op: 0.001203775405883789 seconds 12: Time to load utils op: 0.0011212825775146484 seconds 12: Time to load utils op: 0.0011584758758544922 seconds 12: Time to load utils op: 0.0010666847229003906 seconds 10: Time to load utils op: 0.0012700557708740234 seconds 10: Time to load utils op: 0.0011980533599853516 secondsTime to load utils op: 0.0011610984802246094 seconds 12: Time to load utils op: 0.0011131763458251953 secondsTime to load utils op: 0.0011065006256103516 seconds 10: 10: Time to load utils op: 0.0012123584747314453 seconds 10: Time to load utils op: 0.0012917518615722656 seconds 6: Time to load utils op: 0.0013170242309570312 secondsTime to load utils op: 0.0012211799621582031 seconds 6: 12: 12: Time to load utils op: 0.0011098384857177734 seconds 6: Time to load utils op: 0.0012331008911132812 seconds 12: Time to load utils op: 0.0011551380157470703 seconds 6: Time to load utils op: 0.0012221336364746094 seconds 6: Time to load utils op: 0.0012710094451904297 seconds 1: Time to load utils op: 0.0008311271667480469 seconds 1: Time to load utils op: 0.0009551048278808594 seconds 1: Time to load utils op: 0.0009050369262695312 seconds 1: Time to load utils op: 0.0010221004486083984 seconds 1: Time to load utils op: 0.0011172294616699219 seconds 1: Time to load utils op: 0.00121307373046875 seconds 1: Time to load utils op: 0.0011184215545654297 seconds 1: Time to load utils op: 0.001131296157836914 seconds 4: Time to load utils op: 0.0005033016204833984 seconds 7: Time to load utils op: 0.0004744529724121094 seconds 7: Time to load utils op: 0.000396728515625 seconds 4: Time to load utils op: 0.0004878044128417969 seconds 4: Time to load utils op: 0.00042057037353515625 seconds 7: Time to load utils op: 0.0005142688751220703 seconds 4: Time to load utils op: 0.0005373954772949219 seconds 7: Time to load utils op: 0.0005588531494140625 seconds 0: [2022-11-25 19:12:23,620] [INFO] [utils.py:827:see_memory_usage] after initializing group 0 7: Time to load utils op: 0.0005357265472412109 secondsTime to load utils op: 0.0005593299865722656 seconds 7: 4: Time to load utils op: 0.0006253719329833984 seconds 4: Time to load utils op: 0.0006394386291503906 seconds 7: Time to load utils op: 0.0006148815155029297 seconds 2: Time to load utils op: 0.0010886192321777344 seconds 4: Time to load utils op: 0.0007174015045166016 seconds 7: Time to load utils op: 0.0006394386291503906 seconds 4: Time to load utils op: 0.0007121562957763672 seconds 0: [2022-11-25 19:12:23,620] [INFO] [utils.py:828:see_memory_usage] MA 5.81 GB Max_MA 5.81 GB CA 7.36 GB Max_CA 7 GB 2: Time to load utils op: 0.00136566162109375 seconds 0: [2022-11-25 19:12:23,621] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 2: Time to load utils op: 0.0013306140899658203 secondsTime to load utils op: 0.0013582706451416016 seconds 2: 2: Time to load utils op: 0.0014233589172363281 secondsTime to load utils op: 0.0013518333435058594 seconds 2: 2: Time to load utils op: 0.001383066177368164 seconds 2: Time to load utils op: 0.0014488697052001953 seconds 0: [2022-11-25 19:12:23,656] [INFO] [utils.py:827:see_memory_usage] before initializing group 1 0: [2022-11-25 19:12:23,657] [INFO] [utils.py:828:see_memory_usage] MA 5.81 GB Max_MA 5.81 GB CA 7.36 GB Max_CA 7 GB 0: [2022-11-25 19:12:23,657] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 0: [2022-11-25 19:12:23,692] [INFO] [utils.py:827:see_memory_usage] after initializing group 1 0: [2022-11-25 19:12:23,692] [INFO] [utils.py:828:see_memory_usage] MA 8.52 GB Max_MA 8.52 GB CA 11.39 GB Max_CA 11 GB 0: [2022-11-25 19:12:23,692] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 0: [2022-11-25 19:12:23,725] [INFO] [utils.py:827:see_memory_usage] before initializing group 2 0: [2022-11-25 19:12:23,725] [INFO] [utils.py:828:see_memory_usage] MA 8.52 GB Max_MA 8.52 GB CA 11.39 GB Max_CA 11 GB 0: [2022-11-25 19:12:23,725] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 0: [2022-11-25 19:12:23,760] [INFO] [utils.py:827:see_memory_usage] after initializing group 2 0: [2022-11-25 19:12:23,761] [INFO] [utils.py:828:see_memory_usage] MA 8.52 GB Max_MA 8.52 GB CA 11.39 GB Max_CA 11 GB 0: [2022-11-25 19:12:23,761] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 0: [2022-11-25 19:12:23,791] [INFO] [utils.py:827:see_memory_usage] before initialize_optimizer 0: [2022-11-25 19:12:23,792] [INFO] [utils.py:828:see_memory_usage] MA 8.52 GB Max_MA 8.52 GB CA 11.39 GB Max_CA 11 GB 0: [2022-11-25 19:12:23,792] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 0: [2022-11-25 19:12:23,828] [INFO] [utils.py:827:see_memory_usage] end initialize_optimizer 0: [2022-11-25 19:12:23,828] [INFO] [utils.py:828:see_memory_usage] MA 8.61 GB Max_MA 8.61 GB CA 11.39 GB Max_CA 11 GB 0: [2022-11-25 19:12:23,828] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 0: [2022-11-25 19:12:23,859] [INFO] [utils.py:827:see_memory_usage] end bf16_optimizer 0: [2022-11-25 19:12:23,860] [INFO] [utils.py:828:see_memory_usage] MA 8.61 GB Max_MA 8.61 GB CA 11.39 GB Max_CA 11 GB 0: [2022-11-25 19:12:23,860] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 29.38 GB, percent = 5.8% 0: [2022-11-25 19:12:23,860] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = FusedAdam 0: [2022-11-25 19:12:23,860] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler 0: [2022-11-25 19:12:23,860] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = 0: [2022-11-25 19:12:23,860] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1007:print] DeepSpeedEngine configuration: 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] activation_checkpointing_config { 0: "partition_activations": false, 0: "contiguous_memory_optimization": false, 0: "cpu_checkpointing": false, 0: "number_checkpoints": null, 0: "synchronize_checkpoint_boundary": false, 0: "profile": false 0: } 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] amp_enabled .................. False 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] amp_params ................... False 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] autotuning_config ............ { 0: "enabled": false, 0: "start_step": null, 0: "end_step": null, 0: "metric_path": null, 0: "arg_mappings": null, 0: "metric": "throughput", 0: "model_info": null, 0: "results_dir": "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/autotuning_results", 0: "exps_dir": "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/autotuning_exps", 0: "overwrite": true, 0: "fast": true, 0: "start_profile_step": 3, 0: "end_profile_step": 5, 0: "tuner_type": "gridsearch", 0: "tuner_early_stopping": 5, 0: "tuner_num_trials": 50, 0: "model_info_path": null, 0: "mp_size": 1, 0: "max_train_batch_size": null, 0: "min_train_batch_size": 1, 0: "max_train_micro_batch_size_per_gpu": 1.024000e+03, 0: "min_train_micro_batch_size_per_gpu": 1, 0: "num_tuning_micro_batch_sizes": 3 0: } 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] bfloat16_enabled ............. True 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] checkpoint_parallel_write_pipeline False 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] checkpoint_tag_validation_enabled True 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] checkpoint_tag_validation_fail False 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] comms_config ................. 0: [2022-11-25 19:12:23,861] [INFO] [config.py:1011:print] communication_data_type ...... None 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_pa 0: rameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] curriculum_enabled ........... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] curriculum_params ............ False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] dataloader_drop_last ......... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] disable_allgather ............ False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] dump_state ................... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] dynamic_loss_scale_args ...... None 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_enabled ........... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_gas_boundary_resolution 1 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_layer_name ........ bert.encoder.layer 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_layer_num ......... 0 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_max_iter .......... 100 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_stability ......... 1e-06 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_tol ............... 0.01 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] eigenvalue_verbose ........... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] elasticity_enabled ........... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] flops_profiler_config ........ { 0: "enabled": false, 0: "profile_step": 1, 0: "module_depth": -1, 0: "top_modules": 1, 0: "detailed": true, 0: "output_file": null 0: } 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] fp16_auto_cast ............... None 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] fp16_enabled ................. False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] fp16_master_weights_and_gradients False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] global_rank .................. 0 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] gradient_accumulation_steps .. 1 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] gradient_clipping ............ 1.0 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] gradient_predivide_factor .... 1.0 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] initial_dynamic_scale ........ 1 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] load_universal_checkpoint .... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] loss_scale ................... 1.0 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] memory_breakdown ............. False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] monitor_config ............... 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] nebula_config ................ { 0: "enabled": false, 0: "persistent_storage_path": null, 0: "persistent_time_interval": 100, 0: "num_of_version_in_retention": 2, 0: "enable_nebula_load": true, 0: "load_path": null 0: } 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] optimizer_legacy_fusion ...... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] optimizer_name ............... None 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] optimizer_params ............. None 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] pld_enabled .................. False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] pld_params ................... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] prescale_gradients ........... False 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] scheduler_name ............... None 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] scheduler_params ............. None 0: [2022-11-25 19:12:23,862] [INFO] [config.py:1011:print] sparse_attention ............. None 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] sparse_gradients_enabled ..... False 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] steps_per_print .............. 2000 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] train_batch_size ............. 256 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] train_micro_batch_size_per_gpu 2 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] use_node_local_storage ....... False 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] wall_clock_breakdown ......... False 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] world_size ................... 128 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] zero_allow_untested_optimizer False 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] zero_enabled ................. False 0: [2022-11-25 19:12:23,863] [INFO] [config.py:1011:print] zero_optimization_stage ...... 0 0: [2022-11-25 19:12:23,863] [INFO] [config.py:996:print_user_config] json = { 0: "train_micro_batch_size_per_gpu": 2, 0: "train_batch_size": 256, 0: "gradient_clipping": 1.0, 0: "zero_optimization": { 0: "stage": 0 0: }, 0: "bf16": { 0: "enabled": true 0: }, 0: "steps_per_print": 2.000000e+03, 0: "wall_clock_breakdown": false 0: } 0: Time to load utils op: 0.00040793418884277344 seconds 0: [2022-11-25 19:12:23,863] [INFO] [engine.py:87:__init__] CONFIG: micro_batches=1 micro_batch_size=2 0: [2022-11-25 19:12:23,920] [INFO] [engine.py:145:__init__] RANK=0 STAGE=0 LAYERS=35 [0, 35) STAGE_PARAMS=1517252608 (1517.253M) TOTAL_PARAMS=1517252608 (1517.253M) UNIQUE_PARAMS=1517252608 (1517.253M) 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: WARNING: could not find the metadata file checkpoints_1b5 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: will not load any checkpoints and will start from random 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 14: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 12: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 0: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 8: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,926] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 7: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 10: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 4: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 11: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 6: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 13: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 3: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 2: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 5: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 9: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 1: [2022-11-25 19:12:23,927] [WARNING] [engine.py:2581:load_checkpoint] Unable to find latest file at checkpoints_1b5/latest, if trying to load latest checkpoint please ensure this file exists or pass an explicit checkpoint tag when loading a checkpoint. 15: time (ms) | load-checkpoint: 8.14 0: estimated model parameters: 1.517252608 0: estimated model parameters without embeddings: 1.410035712 0: [after model, optimizer, and learning rate scheduler are built] datetime: 2022-11-25 19:12:24 0: > building train, validation, and test datasets ... 0: > datasets target sizes (minimum size): 0: train: 32109839 0: validation: 32256 0: test: 256 0: > building train, validation, and test datasets for GPT ... 0: > building dataset index ... 0: reading sizes... 0: reading pointers... 0: reading document index... 0: creating numpy buffer of mmap... 0: creating memory view of numpy buffer... 0: > finished creating indexed dataset in 0.008303 seconds 0: number of documents: 210604984 0: > dataset split: 0: train: 0: document indices in [0, 199864130) total of 199864130 documents 0: validation: 0: document indices in [199864130, 210394379) total of 10530249 documents 0: test: 0: document indices in [210394379, 210604984) total of 210605 documents 0: > WARNING: could not find index map files, building the indices on rank 0 ... 0: > only one epoch required, setting separate_last_epoch to False 0: > elasped time to build and save doc-idx mapping (seconds): 14.646731 0: using: 0: number of documents: 199864130 0: number of epochs: 1 0: sequence length: 2048 0: total number of samples: 173377816 0: > elasped time to build and save sample-idx mapping (seconds): 4.333716 0: > building shuffle index with split [0, 173377816) and [173377816, 173377816) ... 0: > elasped time to build and save shuffle-idx mapping (seconds): 10.433865 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_32109839ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_32109839ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_train_indexmap_32109839ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.116 seconds 0: total number of samples: 173377817 0: total number of epochs: 1 0: > WARNING: could not find index map files, building the indices on rank 0 ... 0: > only one epoch required, setting separate_last_epoch to False 0: > elasped time to build and save doc-idx mapping (seconds): 0.502195 0: using: 0: number of documents: 10530249 0: number of epochs: 1 0: sequence length: 2048 0: total number of samples: 9118344 0: > elasped time to build and save sample-idx mapping (seconds): 0.234318 0: > building shuffle index with split [0, 9118344) and [9118344, 9118344) ... 0: > elasped time to build and save shuffle-idx mapping (seconds): 0.283108 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_32256ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_32256ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_valid_indexmap_32256ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.045 seconds 0: total number of samples: 9118345 0: total number of epochs: 1 0: > loading doc-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_256ns_2048sl_1234s_doc_idx.npy 0: > loading sample-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_256ns_2048sl_1234s_sample_idx.npy 0: > loading shuffle-idx mapping from /scratch/project_462000119/data/pile/megatron_data/meg-gpt2_pile_text_document_test_indexmap_256ns_2048sl_1234s_shuffle_idx.npy 0: loaded indexed file in 0.064 seconds 0: total number of samples: 182928 0: total number of epochs: 1 0: > finished creating GPT datasets ... 0: [after dataloaders are built] datetime: 2022-11-25 19:13:11 0: done with setup ... 0: training ... 0: Number of parameters: [tensor rank - pipeline rank] w/ and w/o embeddings: 15: time (ms) | model-and-optimizer-setup: 21945.78 | train/valid/test-data-iterators-setup: 45902.29 0: [000-000] 1.5173B / 1.4100B 0: [before the start of training step] datetime: 2022-11-25 19:13:11 0: [Rank 0] (after 10 iterations) memory (MB) | allocated: 12503.4697265625 | max allocated: 39265.31298828125 | reserved: 41632.0 | max reserved: 41632.0 15: iteration 10/ 125429 | consumed samples: 2560 | consumed tokens: 5242880 | elapsed time per iteration (s): 3.00 | learning rate: 1.595E-06 | global batch size: 256 | lm loss: 1.077618E+01 | grad norm: 51.548 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 85.410 | TFLOPs: 14.11 | 15: iteration 20/ 125429 | consumed samples: 5120 | consumed tokens: 10485760 | elapsed time per iteration (s): 1.03 | learning rate: 3.189E-06 | global batch size: 256 | lm loss: 9.059740E+00 | grad norm: 5.782 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.744 | TFLOPs: 41.11 | 15: iteration 30/ 125429 | consumed samples: 7680 | consumed tokens: 15728640 | elapsed time per iteration (s): 1.09 | learning rate: 4.784E-06 | global batch size: 256 | lm loss: 8.482255E+00 | grad norm: 8.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.153 | TFLOPs: 38.70 | 15: iteration 40/ 125429 | consumed samples: 10240 | consumed tokens: 20971520 | elapsed time per iteration (s): 1.09 | learning rate: 6.378E-06 | global batch size: 256 | lm loss: 8.211993E+00 | grad norm: 2.635 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.999 | TFLOPs: 38.67 | 15: iteration 50/ 125429 | consumed samples: 12800 | consumed tokens: 26214400 | elapsed time per iteration (s): 1.04 | learning rate: 7.973E-06 | global batch size: 256 | lm loss: 7.945257E+00 | grad norm: 3.083 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.057 | TFLOPs: 40.66 | 15: iteration 60/ 125429 | consumed samples: 15360 | consumed tokens: 31457280 | elapsed time per iteration (s): 1.09 | learning rate: 9.567E-06 | global batch size: 256 | lm loss: 7.741173E+00 | grad norm: 3.894 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.364 | TFLOPs: 38.90 | 15: iteration 70/ 125429 | consumed samples: 17920 | consumed tokens: 36700160 | elapsed time per iteration (s): 1.03 | learning rate: 1.116E-05 | global batch size: 256 | lm loss: 7.504540E+00 | grad norm: 4.321 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.971 | TFLOPs: 41.14 | 15: iteration 80/ 125429 | consumed samples: 20480 | consumed tokens: 41943040 | elapsed time per iteration (s): 1.07 | learning rate: 1.276E-05 | global batch size: 256 | lm loss: 7.255930E+00 | grad norm: 2.660 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.055 | TFLOPs: 39.67 | 15: iteration 90/ 125429 | consumed samples: 23040 | consumed tokens: 47185920 | elapsed time per iteration (s): 1.05 | learning rate: 1.435E-05 | global batch size: 256 | lm loss: 7.095193E+00 | grad norm: 2.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.359 | TFLOPs: 40.38 | 15: iteration 100/ 125429 | consumed samples: 25600 | consumed tokens: 52428800 | elapsed time per iteration (s): 1.13 | learning rate: 1.595E-05 | global batch size: 256 | lm loss: 6.902631E+00 | grad norm: 4.451 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.407 | TFLOPs: 37.58 | 15: iteration 110/ 125429 | consumed samples: 28160 | consumed tokens: 57671680 | elapsed time per iteration (s): 1.10 | learning rate: 1.754E-05 | global batch size: 256 | lm loss: 6.733528E+00 | grad norm: 2.836 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.641 | TFLOPs: 38.45 | 15: iteration 120/ 125429 | consumed samples: 30720 | consumed tokens: 62914560 | elapsed time per iteration (s): 1.09 | learning rate: 1.913E-05 | global batch size: 256 | lm loss: 6.600061E+00 | grad norm: 2.487 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.912 | TFLOPs: 38.66 | 15: iteration 130/ 125429 | consumed samples: 33280 | consumed tokens: 68157440 | elapsed time per iteration (s): 1.10 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 6.457391E+00 | grad norm: 3.671 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.702 | TFLOPs: 38.62 | 15: iteration 140/ 125429 | consumed samples: 35840 | consumed tokens: 73400320 | elapsed time per iteration (s): 1.04 | learning rate: 2.232E-05 | global batch size: 256 | lm loss: 6.374030E+00 | grad norm: 2.706 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.770 | TFLOPs: 40.78 | 15: iteration 150/ 125429 | consumed samples: 38400 | consumed tokens: 78643200 | elapsed time per iteration (s): 1.08 | learning rate: 2.392E-05 | global batch size: 256 | lm loss: 6.313078E+00 | grad norm: 2.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.417 | TFLOPs: 39.23 | 15: iteration 160/ 125429 | consumed samples: 40960 | consumed tokens: 83886080 | elapsed time per iteration (s): 1.05 | learning rate: 2.551E-05 | global batch size: 256 | lm loss: 6.178280E+00 | grad norm: 2.556 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.696 | TFLOPs: 40.11 | 15: iteration 170/ 125429 | consumed samples: 43520 | consumed tokens: 89128960 | elapsed time per iteration (s): 1.04 | learning rate: 2.711E-05 | global batch size: 256 | lm loss: 6.167676E+00 | grad norm: 2.925 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.228 | TFLOPs: 40.53 | 15: iteration 180/ 125429 | consumed samples: 46080 | consumed tokens: 94371840 | elapsed time per iteration (s): 1.06 | learning rate: 2.870E-05 | global batch size: 256 | lm loss: 6.061893E+00 | grad norm: 3.879 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.748 | TFLOPs: 39.95 | 15: iteration 190/ 125429 | consumed samples: 48640 | consumed tokens: 99614720 | elapsed time per iteration (s): 1.06 | learning rate: 3.030E-05 | global batch size: 256 | lm loss: 6.028604E+00 | grad norm: 3.109 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.477 | TFLOPs: 39.74 | 15: iteration 200/ 125429 | consumed samples: 51200 | consumed tokens: 104857600 | elapsed time per iteration (s): 1.09 | learning rate: 3.189E-05 | global batch size: 256 | lm loss: 5.969671E+00 | grad norm: 2.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.504 | TFLOPs: 38.92 | 15: iteration 210/ 125429 | consumed samples: 53760 | consumed tokens: 110100480 | elapsed time per iteration (s): 1.07 | learning rate: 3.349E-05 | global batch size: 256 | lm loss: 5.938224E+00 | grad norm: 3.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.868 | TFLOPs: 39.64 | 15: iteration 220/ 125429 | consumed samples: 56320 | consumed tokens: 115343360 | elapsed time per iteration (s): 1.08 | learning rate: 3.508E-05 | global batch size: 256 | lm loss: 5.863256E+00 | grad norm: 2.643 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.003 | TFLOPs: 39.00 | 15: iteration 230/ 125429 | consumed samples: 58880 | consumed tokens: 120586240 | elapsed time per iteration (s): 1.07 | learning rate: 3.667E-05 | global batch size: 256 | lm loss: 5.846325E+00 | grad norm: 3.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.274 | TFLOPs: 39.38 | 15: iteration 240/ 125429 | consumed samples: 61440 | consumed tokens: 125829120 | elapsed time per iteration (s): 1.08 | learning rate: 3.827E-05 | global batch size: 256 | lm loss: 5.824654E+00 | grad norm: 3.041 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.733 | TFLOPs: 39.12 | 15: iteration 250/ 125429 | consumed samples: 64000 | consumed tokens: 131072000 | elapsed time per iteration (s): 1.05 | learning rate: 3.986E-05 | global batch size: 256 | lm loss: 5.758896E+00 | grad norm: 3.534 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.147 | TFLOPs: 40.18 | 15: iteration 260/ 125429 | consumed samples: 66560 | consumed tokens: 136314880 | elapsed time per iteration (s): 1.08 | learning rate: 4.146E-05 | global batch size: 256 | lm loss: 5.734486E+00 | grad norm: 2.272 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.369 | TFLOPs: 39.23 | 15: iteration 270/ 125429 | consumed samples: 69120 | consumed tokens: 141557760 | elapsed time per iteration (s): 1.10 | learning rate: 4.305E-05 | global batch size: 256 | lm loss: 5.662425E+00 | grad norm: 3.362 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.890 | TFLOPs: 38.32 | 15: iteration 280/ 125429 | consumed samples: 71680 | consumed tokens: 146800640 | elapsed time per iteration (s): 1.07 | learning rate: 4.465E-05 | global batch size: 256 | lm loss: 5.644223E+00 | grad norm: 2.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.875 | TFLOPs: 39.48 | 15: iteration 290/ 125429 | consumed samples: 74240 | consumed tokens: 152043520 | elapsed time per iteration (s): 1.07 | learning rate: 4.624E-05 | global batch size: 256 | lm loss: 5.584693E+00 | grad norm: 2.870 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.623 | TFLOPs: 39.43 | 15: iteration 300/ 125429 | consumed samples: 76800 | consumed tokens: 157286400 | elapsed time per iteration (s): 1.10 | learning rate: 4.784E-05 | global batch size: 256 | lm loss: 5.501810E+00 | grad norm: 3.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.789 | TFLOPs: 38.47 | 15: iteration 310/ 125429 | consumed samples: 79360 | consumed tokens: 162529280 | elapsed time per iteration (s): 1.08 | learning rate: 4.943E-05 | global batch size: 256 | lm loss: 5.523584E+00 | grad norm: 2.049 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.238 | TFLOPs: 39.21 | 15: iteration 320/ 125429 | consumed samples: 81920 | consumed tokens: 167772160 | elapsed time per iteration (s): 1.05 | learning rate: 5.102E-05 | global batch size: 256 | lm loss: 5.512503E+00 | grad norm: 2.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.306 | TFLOPs: 40.21 | 15: iteration 330/ 125429 | consumed samples: 84480 | consumed tokens: 173015040 | elapsed time per iteration (s): 1.05 | learning rate: 5.262E-05 | global batch size: 256 | lm loss: 5.474942E+00 | grad norm: 2.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.162 | TFLOPs: 40.35 | 15: iteration 340/ 125429 | consumed samples: 87040 | consumed tokens: 178257920 | elapsed time per iteration (s): 1.09 | learning rate: 5.421E-05 | global batch size: 256 | lm loss: 5.441616E+00 | grad norm: 2.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.451 | TFLOPs: 38.91 | 15: iteration 350/ 125429 | consumed samples: 89600 | consumed tokens: 183500800 | elapsed time per iteration (s): 1.06 | learning rate: 5.581E-05 | global batch size: 256 | lm loss: 5.392713E+00 | grad norm: 1.785 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.469 | TFLOPs: 39.74 | 15: iteration 360/ 125429 | consumed samples: 92160 | consumed tokens: 188743680 | elapsed time per iteration (s): 1.09 | learning rate: 5.740E-05 | global batch size: 256 | lm loss: 5.352441E+00 | grad norm: 2.071 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.112 | TFLOPs: 38.85 | 15: iteration 370/ 125429 | consumed samples: 94720 | consumed tokens: 193986560 | elapsed time per iteration (s): 1.07 | learning rate: 5.900E-05 | global batch size: 256 | lm loss: 5.286845E+00 | grad norm: 2.266 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.795 | TFLOPs: 39.63 | 15: iteration 380/ 125429 | consumed samples: 97280 | consumed tokens: 199229440 | elapsed time per iteration (s): 1.08 | learning rate: 6.059E-05 | global batch size: 256 | lm loss: 5.294839E+00 | grad norm: 1.967 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.406 | TFLOPs: 39.23 | 15: iteration 390/ 125429 | consumed samples: 99840 | consumed tokens: 204472320 | elapsed time per iteration (s): 1.09 | learning rate: 6.219E-05 | global batch size: 256 | lm loss: 5.248099E+00 | grad norm: 1.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.591 | TFLOPs: 38.93 | 15: iteration 400/ 125429 | consumed samples: 102400 | consumed tokens: 209715200 | elapsed time per iteration (s): 1.12 | learning rate: 6.378E-05 | global batch size: 256 | lm loss: 5.256598E+00 | grad norm: 1.641 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.057 | TFLOPs: 37.85 | 15: iteration 410/ 125429 | consumed samples: 104960 | consumed tokens: 214958080 | elapsed time per iteration (s): 1.07 | learning rate: 6.538E-05 | global batch size: 256 | lm loss: 5.212085E+00 | grad norm: 1.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.872 | TFLOPs: 39.48 | 15: iteration 420/ 125429 | consumed samples: 107520 | consumed tokens: 220200960 | elapsed time per iteration (s): 1.04 | learning rate: 6.697E-05 | global batch size: 256 | lm loss: 5.244009E+00 | grad norm: 2.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.795 | TFLOPs: 40.78 | 15: iteration 430/ 125429 | consumed samples: 110080 | consumed tokens: 225443840 | elapsed time per iteration (s): 1.08 | learning rate: 6.856E-05 | global batch size: 256 | lm loss: 5.171355E+00 | grad norm: 1.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.170 | TFLOPs: 39.03 | 15: iteration 440/ 125429 | consumed samples: 112640 | consumed tokens: 230686720 | elapsed time per iteration (s): 1.07 | learning rate: 7.016E-05 | global batch size: 256 | lm loss: 5.121366E+00 | grad norm: 1.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.039 | TFLOPs: 39.50 | 15: iteration 450/ 125429 | consumed samples: 115200 | consumed tokens: 235929600 | elapsed time per iteration (s): 1.07 | learning rate: 7.175E-05 | global batch size: 256 | lm loss: 5.082193E+00 | grad norm: 1.529 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.969 | TFLOPs: 39.66 | 15: iteration 460/ 125429 | consumed samples: 117760 | consumed tokens: 241172480 | elapsed time per iteration (s): 1.12 | learning rate: 7.335E-05 | global batch size: 256 | lm loss: 5.089662E+00 | grad norm: 2.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.671 | TFLOPs: 37.79 | 15: iteration 470/ 125429 | consumed samples: 120320 | consumed tokens: 246415360 | elapsed time per iteration (s): 1.06 | learning rate: 7.494E-05 | global batch size: 256 | lm loss: 5.062902E+00 | grad norm: 1.871 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.239 | TFLOPs: 40.03 | 15: iteration 480/ 125429 | consumed samples: 122880 | consumed tokens: 251658240 | elapsed time per iteration (s): 1.08 | learning rate: 7.654E-05 | global batch size: 256 | lm loss: 5.042426E+00 | grad norm: 1.770 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.364 | TFLOPs: 39.06 | 15: iteration 490/ 125429 | consumed samples: 125440 | consumed tokens: 256901120 | elapsed time per iteration (s): 1.07 | learning rate: 7.813E-05 | global batch size: 256 | lm loss: 4.986177E+00 | grad norm: 2.304 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.036 | TFLOPs: 39.50 | 15: iteration 500/ 125429 | consumed samples: 128000 | consumed tokens: 262144000 | elapsed time per iteration (s): 1.06 | learning rate: 7.973E-05 | global batch size: 256 | lm loss: 4.973347E+00 | grad norm: 1.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.270 | TFLOPs: 39.87 | 15: iteration 510/ 125429 | consumed samples: 130560 | consumed tokens: 267386880 | elapsed time per iteration (s): 1.07 | learning rate: 8.132E-05 | global batch size: 256 | lm loss: 4.946632E+00 | grad norm: 1.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.713 | TFLOPs: 39.61 | 15: iteration 520/ 125429 | consumed samples: 133120 | consumed tokens: 272629760 | elapsed time per iteration (s): 1.08 | learning rate: 8.292E-05 | global batch size: 256 | lm loss: 4.910819E+00 | grad norm: 1.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.192 | TFLOPs: 39.03 | 15: iteration 530/ 125429 | consumed samples: 135680 | consumed tokens: 277872640 | elapsed time per iteration (s): 1.08 | learning rate: 8.451E-05 | global batch size: 256 | lm loss: 4.854698E+00 | grad norm: 1.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.372 | TFLOPs: 39.23 | 15: iteration 540/ 125429 | consumed samples: 138240 | consumed tokens: 283115520 | elapsed time per iteration (s): 1.06 | learning rate: 8.610E-05 | global batch size: 256 | lm loss: 4.900752E+00 | grad norm: 1.515 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.197 | TFLOPs: 39.86 | 15: iteration 550/ 125429 | consumed samples: 140800 | consumed tokens: 288358400 | elapsed time per iteration (s): 1.07 | learning rate: 8.770E-05 | global batch size: 256 | lm loss: 4.876602E+00 | grad norm: 1.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.762 | TFLOPs: 39.62 | 15: iteration 560/ 125429 | consumed samples: 143360 | consumed tokens: 293601280 | elapsed time per iteration (s): 1.06 | learning rate: 8.929E-05 | global batch size: 256 | lm loss: 4.875106E+00 | grad norm: 1.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.844 | TFLOPs: 39.80 | 15: iteration 570/ 125429 | consumed samples: 145920 | consumed tokens: 298844160 | elapsed time per iteration (s): 1.09 | learning rate: 9.089E-05 | global batch size: 256 | lm loss: 4.821296E+00 | grad norm: 1.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.801 | TFLOPs: 38.97 | 15: iteration 580/ 125429 | consumed samples: 148480 | consumed tokens: 304087040 | elapsed time per iteration (s): 1.03 | learning rate: 9.248E-05 | global batch size: 256 | lm loss: 4.771640E+00 | grad norm: 1.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.359 | TFLOPs: 41.04 | 15: iteration 590/ 125429 | consumed samples: 151040 | consumed tokens: 309329920 | elapsed time per iteration (s): 1.04 | learning rate: 9.408E-05 | global batch size: 256 | lm loss: 4.721441E+00 | grad norm: 1.342 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.795 | TFLOPs: 40.62 | 15: iteration 600/ 125429 | consumed samples: 153600 | consumed tokens: 314572800 | elapsed time per iteration (s): 1.04 | learning rate: 9.567E-05 | global batch size: 256 | lm loss: 4.698285E+00 | grad norm: 1.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.184 | TFLOPs: 40.52 | 15: iteration 610/ 125429 | consumed samples: 156160 | consumed tokens: 319815680 | elapsed time per iteration (s): 1.10 | learning rate: 9.727E-05 | global batch size: 256 | lm loss: 4.679316E+00 | grad norm: 1.303 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.565 | TFLOPs: 38.60 | 15: iteration 620/ 125429 | consumed samples: 158720 | consumed tokens: 325058560 | elapsed time per iteration (s): 1.06 | learning rate: 9.886E-05 | global batch size: 256 | lm loss: 4.623370E+00 | grad norm: 1.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.903 | TFLOPs: 39.98 | 15: iteration 630/ 125429 | consumed samples: 161280 | consumed tokens: 330301440 | elapsed time per iteration (s): 1.07 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 4.689996E+00 | grad norm: 1.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.624 | TFLOPs: 39.43 | 15: iteration 640/ 125429 | consumed samples: 163840 | consumed tokens: 335544320 | elapsed time per iteration (s): 1.04 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 4.594407E+00 | grad norm: 1.093 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.480 | TFLOPs: 40.57 | 15: iteration 650/ 125429 | consumed samples: 166400 | consumed tokens: 340787200 | elapsed time per iteration (s): 1.10 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 4.603118E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.181 | TFLOPs: 38.53 | 15: iteration 660/ 125429 | consumed samples: 168960 | consumed tokens: 346030080 | elapsed time per iteration (s): 1.03 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 4.533239E+00 | grad norm: 1.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.330 | TFLOPs: 41.04 | 15: iteration 670/ 125429 | consumed samples: 171520 | consumed tokens: 351272960 | elapsed time per iteration (s): 1.07 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 4.472744E+00 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.721 | TFLOPs: 39.45 | 15: iteration 680/ 125429 | consumed samples: 174080 | consumed tokens: 356515840 | elapsed time per iteration (s): 1.11 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 4.466088E+00 | grad norm: 1.104 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.665 | TFLOPs: 38.28 | 15: iteration 690/ 125429 | consumed samples: 176640 | consumed tokens: 361758720 | elapsed time per iteration (s): 1.06 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 4.437840E+00 | grad norm: 1.081 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.501 | TFLOPs: 39.74 | 15: iteration 700/ 125429 | consumed samples: 179200 | consumed tokens: 367001600 | elapsed time per iteration (s): 1.06 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 4.408956E+00 | grad norm: 1.266 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.934 | TFLOPs: 39.82 | 15: iteration 710/ 125429 | consumed samples: 181760 | consumed tokens: 372244480 | elapsed time per iteration (s): 1.06 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 4.366637E+00 | grad norm: 1.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.391 | TFLOPs: 39.73 | 15: iteration 720/ 125429 | consumed samples: 184320 | consumed tokens: 377487360 | elapsed time per iteration (s): 1.06 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 4.366453E+00 | grad norm: 1.108 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.097 | TFLOPs: 39.84 | 15: iteration 730/ 125429 | consumed samples: 186880 | consumed tokens: 382730240 | elapsed time per iteration (s): 1.04 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 4.298931E+00 | grad norm: 0.919 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.759 | TFLOPs: 40.61 | 15: iteration 740/ 125429 | consumed samples: 189440 | consumed tokens: 387973120 | elapsed time per iteration (s): 1.06 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 4.234118E+00 | grad norm: 1.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.175 | TFLOPs: 39.86 | 15: iteration 750/ 125429 | consumed samples: 192000 | consumed tokens: 393216000 | elapsed time per iteration (s): 1.79 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 4.157243E+00 | grad norm: 1.286 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 142.982 | TFLOPs: 23.63 | 15: iteration 760/ 125429 | consumed samples: 194560 | consumed tokens: 398458880 | elapsed time per iteration (s): 1.07 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 4.155796E+00 | grad norm: 0.962 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.477 | TFLOPs: 39.41 | 15: iteration 770/ 125429 | consumed samples: 197120 | consumed tokens: 403701760 | elapsed time per iteration (s): 1.06 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 4.127178E+00 | grad norm: 1.027 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.134 | TFLOPs: 39.85 | 15: iteration 780/ 125429 | consumed samples: 199680 | consumed tokens: 408944640 | elapsed time per iteration (s): 1.08 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 4.029021E+00 | grad norm: 1.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.394 | TFLOPs: 39.23 | 15: iteration 790/ 125429 | consumed samples: 202240 | consumed tokens: 414187520 | elapsed time per iteration (s): 1.06 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 3.975605E+00 | grad norm: 1.070 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.232 | TFLOPs: 40.03 | 15: iteration 800/ 125429 | consumed samples: 204800 | consumed tokens: 419430400 | elapsed time per iteration (s): 1.08 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 3.930988E+00 | grad norm: 1.060 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.646 | TFLOPs: 39.11 | 15: iteration 810/ 125429 | consumed samples: 207360 | consumed tokens: 424673280 | elapsed time per iteration (s): 1.05 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 3.880657E+00 | grad norm: 1.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.328 | TFLOPs: 40.38 | 15: iteration 820/ 125429 | consumed samples: 209920 | consumed tokens: 429916160 | elapsed time per iteration (s): 1.09 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 3.869842E+00 | grad norm: 1.005 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.861 | TFLOPs: 38.81 | 15: iteration 830/ 125429 | consumed samples: 212480 | consumed tokens: 435159040 | elapsed time per iteration (s): 1.04 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 3.871001E+00 | grad norm: 1.073 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.217 | TFLOPs: 40.52 | 15: iteration 840/ 125429 | consumed samples: 215040 | consumed tokens: 440401920 | elapsed time per iteration (s): 1.03 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 3.775337E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.605 | TFLOPs: 40.92 | 15: iteration 850/ 125429 | consumed samples: 217600 | consumed tokens: 445644800 | elapsed time per iteration (s): 1.06 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 3.745628E+00 | grad norm: 0.914 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.571 | TFLOPs: 40.09 | 15: iteration 860/ 125429 | consumed samples: 220160 | consumed tokens: 450887680 | elapsed time per iteration (s): 1.04 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 3.728184E+00 | grad norm: 1.062 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.636 | TFLOPs: 40.59 | 15: iteration 870/ 125429 | consumed samples: 222720 | consumed tokens: 456130560 | elapsed time per iteration (s): 1.06 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 3.735841E+00 | grad norm: 0.910 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.798 | TFLOPs: 39.96 | 15: iteration 880/ 125429 | consumed samples: 225280 | consumed tokens: 461373440 | elapsed time per iteration (s): 1.06 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 3.729334E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.389 | TFLOPs: 40.06 | 15: iteration 890/ 125429 | consumed samples: 227840 | consumed tokens: 466616320 | elapsed time per iteration (s): 1.07 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 3.657848E+00 | grad norm: 0.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.713 | TFLOPs: 39.45 | 15: iteration 900/ 125429 | consumed samples: 230400 | consumed tokens: 471859200 | elapsed time per iteration (s): 1.04 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 3.634362E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.527 | TFLOPs: 40.58 | 15: iteration 910/ 125429 | consumed samples: 232960 | consumed tokens: 477102080 | elapsed time per iteration (s): 1.06 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 3.636940E+00 | grad norm: 0.627 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.335 | TFLOPs: 40.05 | 15: iteration 920/ 125429 | consumed samples: 235520 | consumed tokens: 482344960 | elapsed time per iteration (s): 1.06 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 3.625816E+00 | grad norm: 0.712 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.989 | TFLOPs: 39.99 | 15: iteration 930/ 125429 | consumed samples: 238080 | consumed tokens: 487587840 | elapsed time per iteration (s): 1.05 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 3.573460E+00 | grad norm: 0.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.978 | TFLOPs: 40.15 | 15: iteration 940/ 125429 | consumed samples: 240640 | consumed tokens: 492830720 | elapsed time per iteration (s): 1.04 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 3.530424E+00 | grad norm: 0.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.372 | TFLOPs: 40.55 | 15: iteration 950/ 125429 | consumed samples: 243200 | consumed tokens: 498073600 | elapsed time per iteration (s): 1.05 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 3.537252E+00 | grad norm: 0.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.914 | TFLOPs: 40.14 | 15: iteration 960/ 125429 | consumed samples: 245760 | consumed tokens: 503316480 | elapsed time per iteration (s): 1.03 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 3.499709E+00 | grad norm: 0.652 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.081 | TFLOPs: 41.00 | 15: iteration 970/ 125429 | consumed samples: 248320 | consumed tokens: 508559360 | elapsed time per iteration (s): 1.06 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 3.490598E+00 | grad norm: 0.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.898 | TFLOPs: 39.81 | 15: iteration 980/ 125429 | consumed samples: 250880 | consumed tokens: 513802240 | elapsed time per iteration (s): 1.09 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 3.439832E+00 | grad norm: 0.529 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.710 | TFLOPs: 38.95 | 15: iteration 990/ 125429 | consumed samples: 253440 | consumed tokens: 519045120 | elapsed time per iteration (s): 1.07 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 3.445081E+00 | grad norm: 0.637 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.198 | TFLOPs: 39.69 | 15: iteration 1000/ 125429 | consumed samples: 256000 | consumed tokens: 524288000 | elapsed time per iteration (s): 1.06 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 3.426571E+00 | grad norm: 0.684 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.969 | TFLOPs: 39.99 | 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 1000 | lm loss value: 3.378257E+00 | lm loss PPL: 2.931963E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 1000 to checkpoints_1b5 0: [2022-11-25 19:31:25,817] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step1000 is begin to save! 0: [2022-11-25 19:31:26,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_01-model_00-model_states.pt... 0: [2022-11-25 19:31:26,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_01-model_00-model_states.pt. 0: [2022-11-25 19:31:26,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_03-model_00-model_states.pt... 0: [2022-11-25 19:31:26,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_03-model_00-model_states.pt. 0: [2022-11-25 19:31:26,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_04-model_00-model_states.pt... 0: [2022-11-25 19:31:26,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_04-model_00-model_states.pt. 0: [2022-11-25 19:31:26,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_05-model_00-model_states.pt... 0: [2022-11-25 19:31:26,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_05-model_00-model_states.pt. 0: [2022-11-25 19:31:26,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_06-model_00-model_states.pt... 0: [2022-11-25 19:31:26,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_06-model_00-model_states.pt. 0: [2022-11-25 19:31:26,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_07-model_00-model_states.pt... 0: [2022-11-25 19:31:27,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_07-model_00-model_states.pt. 0: [2022-11-25 19:31:27,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_08-model_00-model_states.pt... 0: [2022-11-25 19:31:27,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_08-model_00-model_states.pt. 0: [2022-11-25 19:31:27,191] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_09-model_00-model_states.pt... 0: [2022-11-25 19:31:27,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_09-model_00-model_states.pt. 0: [2022-11-25 19:31:27,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_10-model_00-model_states.pt... 0: [2022-11-25 19:31:27,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_10-model_00-model_states.pt. 0: [2022-11-25 19:31:27,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_11-model_00-model_states.pt... 0: [2022-11-25 19:31:27,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_11-model_00-model_states.pt. 0: [2022-11-25 19:31:27,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_12-model_00-model_states.pt... 0: [2022-11-25 19:31:27,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_12-model_00-model_states.pt. 0: [2022-11-25 19:31:27,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_13-model_00-model_states.pt... 0: [2022-11-25 19:31:27,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_13-model_00-model_states.pt. 0: [2022-11-25 19:31:27,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_14-model_00-model_states.pt... 0: [2022-11-25 19:31:28,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_14-model_00-model_states.pt. 0: [2022-11-25 19:31:28,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_15-model_00-model_states.pt... 0: [2022-11-25 19:31:28,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_15-model_00-model_states.pt. 0: [2022-11-25 19:31:28,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_16-model_00-model_states.pt... 0: [2022-11-25 19:31:28,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_16-model_00-model_states.pt. 0: [2022-11-25 19:31:28,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_17-model_00-model_states.pt... 0: [2022-11-25 19:31:28,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_17-model_00-model_states.pt. 0: [2022-11-25 19:31:28,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_18-model_00-model_states.pt... 0: [2022-11-25 19:31:28,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_18-model_00-model_states.pt. 0: [2022-11-25 19:31:28,474] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_19-model_00-model_states.pt... 0: [2022-11-25 19:31:28,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_19-model_00-model_states.pt. 0: [2022-11-25 19:31:28,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_20-model_00-model_states.pt... 0: [2022-11-25 19:31:28,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_20-model_00-model_states.pt. 0: [2022-11-25 19:31:28,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_21-model_00-model_states.pt... 0: [2022-11-25 19:31:28,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_21-model_00-model_states.pt. 0: [2022-11-25 19:31:28,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_22-model_00-model_states.pt... 0: [2022-11-25 19:31:28,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_22-model_00-model_states.pt. 0: [2022-11-25 19:31:28,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_23-model_00-model_states.pt... 0: [2022-11-25 19:31:29,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_23-model_00-model_states.pt. 0: [2022-11-25 19:31:29,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_24-model_00-model_states.pt... 0: [2022-11-25 19:31:29,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_24-model_00-model_states.pt. 0: [2022-11-25 19:31:29,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_25-model_00-model_states.pt... 0: [2022-11-25 19:31:29,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_25-model_00-model_states.pt. 0: [2022-11-25 19:31:29,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_26-model_00-model_states.pt... 0: [2022-11-25 19:31:29,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_26-model_00-model_states.pt. 0: [2022-11-25 19:31:29,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_27-model_00-model_states.pt... 0: [2022-11-25 19:31:29,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_27-model_00-model_states.pt. 0: [2022-11-25 19:31:29,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_28-model_00-model_states.pt... 0: [2022-11-25 19:31:29,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_28-model_00-model_states.pt. 0: [2022-11-25 19:31:29,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_29-model_00-model_states.pt... 0: [2022-11-25 19:31:29,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_29-model_00-model_states.pt. 0: [2022-11-25 19:31:29,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_30-model_00-model_states.pt... 0: [2022-11-25 19:31:29,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_30-model_00-model_states.pt. 0: [2022-11-25 19:31:29,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/layer_32-model_00-model_states.pt... 0: [2022-11-25 19:31:29,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/layer_32-model_00-model_states.pt. 0: [2022-11-25 19:31:29,850] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step1000/mp_rank_00_model_states.pt 0: [2022-11-25 19:31:29,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/mp_rank_00_model_states.pt... 0: [2022-11-25 19:31:29,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/mp_rank_00_model_states.pt. 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 1: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 19:31:29,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 0: [2022-11-25 19:31:30,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 19:31:30,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 19:31:30,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 19:31:30,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 19:31:30,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 19:31:30,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 19:31:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 19:31:30,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 19:31:30,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 19:31:30,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 19:31:30,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 19:31:30,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 19:31:30,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 9: [2022-11-25 19:31:30,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 19:31:30,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 19:31:30,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 19:31:30,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:30,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:30,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:30,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 15: [2022-11-25 19:31:30,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 19:31:30,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 19:31:30,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:30,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:30,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:30,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 19:31:30,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 10: [2022-11-25 19:31:30,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 4: [2022-11-25 19:31:30,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 19:31:30,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 19:31:30,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:30,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 19:31:30,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 19:31:30,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 14: [2022-11-25 19:31:30,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 19:31:30,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:30,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:30,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 11: [2022-11-25 19:31:30,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 19:31:30,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 19:31:30,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 19:31:30,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 7: [2022-11-25 19:31:30,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 19:31:30,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 19:31:30,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:30,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 5: [2022-11-25 19:31:30,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 19:31:30,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 19:31:30,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 19:31:30,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:30,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 19:31:30,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:30,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 19:31:30,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 6: [2022-11-25 19:31:30,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 19:31:30,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 19:31:30,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 13: [2022-11-25 19:31:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 19:31:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 19:31:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 19:31:30,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 19:31:30,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 12: [2022-11-25 19:31:30,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 19:31:30,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 19:31:30,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 2: [2022-11-25 19:31:30,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 19:31:30,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 19:31:30,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: [2022-11-25 19:31:30,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 19:31:30,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 3: [2022-11-25 19:31:30,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 19:31:30,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 19:31:30,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:30,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 19:31:30,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 19:31:30,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 1: [2022-11-25 19:31:30,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 8: [2022-11-25 19:31:30,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 19:31:30,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step1000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 19:31:30,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step1000 is ready now! 0: successfully saved checkpoint at iteration 1000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4772.90 15: iteration 1010/ 125429 | consumed samples: 258560 | consumed tokens: 529530880 | elapsed time per iteration (s): 1.57 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 3.449520E+00 | grad norm: 0.578 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 162.694 | TFLOPs: 26.89 | 15: iteration 1020/ 125429 | consumed samples: 261120 | consumed tokens: 534773760 | elapsed time per iteration (s): 1.03 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 3.402298E+00 | grad norm: 0.547 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.187 | TFLOPs: 41.18 | 15: iteration 1030/ 125429 | consumed samples: 263680 | consumed tokens: 540016640 | elapsed time per iteration (s): 1.04 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 3.389581E+00 | grad norm: 0.559 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.163 | TFLOPs: 40.52 | 15: iteration 1040/ 125429 | consumed samples: 266240 | consumed tokens: 545259520 | elapsed time per iteration (s): 1.05 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 3.433636E+00 | grad norm: 0.661 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.357 | TFLOPs: 40.38 | 15: iteration 1050/ 125429 | consumed samples: 268800 | consumed tokens: 550502400 | elapsed time per iteration (s): 1.06 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 3.408562E+00 | grad norm: 0.598 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.297 | TFLOPs: 39.88 | 15: iteration 1060/ 125429 | consumed samples: 271360 | consumed tokens: 555745280 | elapsed time per iteration (s): 1.04 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 3.371330E+00 | grad norm: 0.789 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.878 | TFLOPs: 40.63 | 15: iteration 1070/ 125429 | consumed samples: 273920 | consumed tokens: 560988160 | elapsed time per iteration (s): 1.03 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 3.362850E+00 | grad norm: 0.597 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.343 | TFLOPs: 41.04 | 15: iteration 1080/ 125429 | consumed samples: 276480 | consumed tokens: 566231040 | elapsed time per iteration (s): 1.03 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 3.303613E+00 | grad norm: 0.607 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.459 | TFLOPs: 41.22 | 15: iteration 1090/ 125429 | consumed samples: 279040 | consumed tokens: 571473920 | elapsed time per iteration (s): 1.02 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 3.321942E+00 | grad norm: 0.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.874 | TFLOPs: 41.29 | 15: iteration 1100/ 125429 | consumed samples: 281600 | consumed tokens: 576716800 | elapsed time per iteration (s): 1.05 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 3.346019E+00 | grad norm: 0.537 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.913 | TFLOPs: 40.47 | 15: iteration 1110/ 125429 | consumed samples: 284160 | consumed tokens: 581959680 | elapsed time per iteration (s): 1.02 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 3.294529E+00 | grad norm: 0.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.439 | TFLOPs: 41.55 | 15: iteration 1120/ 125429 | consumed samples: 286720 | consumed tokens: 587202560 | elapsed time per iteration (s): 1.05 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 3.315728E+00 | grad norm: 0.657 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.065 | TFLOPs: 40.17 | 15: iteration 1130/ 125429 | consumed samples: 289280 | consumed tokens: 592445440 | elapsed time per iteration (s): 1.02 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 3.320697E+00 | grad norm: 0.532 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.668 | TFLOPs: 41.42 | 15: iteration 1140/ 125429 | consumed samples: 291840 | consumed tokens: 597688320 | elapsed time per iteration (s): 1.05 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 3.260020E+00 | grad norm: 0.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.841 | TFLOPs: 40.13 | 15: iteration 1150/ 125429 | consumed samples: 294400 | consumed tokens: 602931200 | elapsed time per iteration (s): 1.05 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 3.290100E+00 | grad norm: 0.629 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.185 | TFLOPs: 40.19 | 15: iteration 1160/ 125429 | consumed samples: 296960 | consumed tokens: 608174080 | elapsed time per iteration (s): 1.03 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 3.275365E+00 | grad norm: 0.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.117 | TFLOPs: 41.00 | 15: iteration 1170/ 125429 | consumed samples: 299520 | consumed tokens: 613416960 | elapsed time per iteration (s): 1.04 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 3.281885E+00 | grad norm: 0.546 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.690 | TFLOPs: 40.77 | 15: iteration 1180/ 125429 | consumed samples: 302080 | consumed tokens: 618659840 | elapsed time per iteration (s): 1.04 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 3.285094E+00 | grad norm: 0.524 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.502 | TFLOPs: 40.74 | 15: iteration 1190/ 125429 | consumed samples: 304640 | consumed tokens: 623902720 | elapsed time per iteration (s): 1.02 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 3.287851E+00 | grad norm: 0.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.254 | TFLOPs: 41.36 | 15: iteration 1200/ 125429 | consumed samples: 307200 | consumed tokens: 629145600 | elapsed time per iteration (s): 1.06 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 3.252607E+00 | grad norm: 0.484 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.439 | TFLOPs: 39.90 | 15: iteration 1210/ 125429 | consumed samples: 309760 | consumed tokens: 634388480 | elapsed time per iteration (s): 1.04 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 3.233026E+00 | grad norm: 0.481 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.538 | TFLOPs: 40.74 | 15: iteration 1220/ 125429 | consumed samples: 312320 | consumed tokens: 639631360 | elapsed time per iteration (s): 1.04 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 3.167450E+00 | grad norm: 0.510 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.725 | TFLOPs: 40.61 | 15: iteration 1230/ 125429 | consumed samples: 314880 | consumed tokens: 644874240 | elapsed time per iteration (s): 1.06 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 3.218959E+00 | grad norm: 0.505 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.874 | TFLOPs: 39.97 | 15: iteration 1240/ 125429 | consumed samples: 317440 | consumed tokens: 650117120 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 3.188704E+00 | grad norm: 0.500 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.103 | TFLOPs: 40.51 | 15: iteration 1250/ 125429 | consumed samples: 320000 | consumed tokens: 655360000 | elapsed time per iteration (s): 1.02 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 3.194676E+00 | grad norm: 0.531 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.224 | TFLOPs: 41.35 | 15: iteration 1260/ 125429 | consumed samples: 322560 | consumed tokens: 660602880 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.136413E+00 | grad norm: 0.534 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.379 | TFLOPs: 41.21 | 15: iteration 1270/ 125429 | consumed samples: 325120 | consumed tokens: 665845760 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.190918E+00 | grad norm: 0.454 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.742 | TFLOPs: 40.61 | 15: iteration 1280/ 125429 | consumed samples: 327680 | consumed tokens: 671088640 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.350917E+00 | grad norm: 2.454 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.961 | TFLOPs: 41.14 | 15: iteration 1290/ 125429 | consumed samples: 330240 | consumed tokens: 676331520 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.244411E+00 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.157 | TFLOPs: 39.52 | 15: iteration 1300/ 125429 | consumed samples: 332800 | consumed tokens: 681574400 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.288733E+00 | grad norm: 0.986 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.118 | TFLOPs: 40.51 | 15: iteration 1310/ 125429 | consumed samples: 335360 | consumed tokens: 686817280 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.240121E+00 | grad norm: 0.572 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.838 | TFLOPs: 40.79 | 15: iteration 1320/ 125429 | consumed samples: 337920 | consumed tokens: 692060160 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.183672E+00 | grad norm: 0.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.297 | TFLOPs: 41.03 | 15: iteration 1330/ 125429 | consumed samples: 340480 | consumed tokens: 697303040 | elapsed time per iteration (s): 1.09 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.157219E+00 | grad norm: 0.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.327 | TFLOPs: 38.89 | 15: iteration 1340/ 125429 | consumed samples: 343040 | consumed tokens: 702545920 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.150960E+00 | grad norm: 0.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.972 | TFLOPs: 40.48 | 15: iteration 1350/ 125429 | consumed samples: 345600 | consumed tokens: 707788800 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.114101E+00 | grad norm: 0.360 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.347 | TFLOPs: 40.55 | 15: iteration 1360/ 125429 | consumed samples: 348160 | consumed tokens: 713031680 | elapsed time per iteration (s): 1.02 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.075077E+00 | grad norm: 0.486 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.553 | TFLOPs: 41.41 | 15: iteration 1370/ 125429 | consumed samples: 350720 | consumed tokens: 718274560 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.112844E+00 | grad norm: 0.457 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.769 | TFLOPs: 40.45 | 15: iteration 1380/ 125429 | consumed samples: 353280 | consumed tokens: 723517440 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.110508E+00 | grad norm: 0.405 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.689 | TFLOPs: 40.60 | 15: iteration 1390/ 125429 | consumed samples: 355840 | consumed tokens: 728760320 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.095189E+00 | grad norm: 0.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.334 | TFLOPs: 40.71 | 15: iteration 1400/ 125429 | consumed samples: 358400 | consumed tokens: 734003200 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.127606E+00 | grad norm: 0.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.493 | TFLOPs: 39.91 | 15: iteration 1410/ 125429 | consumed samples: 360960 | consumed tokens: 739246080 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.091992E+00 | grad norm: 0.401 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.853 | TFLOPs: 40.96 | 15: iteration 1420/ 125429 | consumed samples: 363520 | consumed tokens: 744488960 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.043471E+00 | grad norm: 0.427 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.205 | TFLOPs: 41.02 | 15: iteration 1430/ 125429 | consumed samples: 366080 | consumed tokens: 749731840 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.041308E+00 | grad norm: 0.428 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.343 | TFLOPs: 41.21 | 15: iteration 1440/ 125429 | consumed samples: 368640 | consumed tokens: 754974720 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.073335E+00 | grad norm: 0.416 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.786 | TFLOPs: 40.62 | 15: iteration 1450/ 125429 | consumed samples: 371200 | consumed tokens: 760217600 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.055293E+00 | grad norm: 0.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.529 | TFLOPs: 40.41 | 15: iteration 1460/ 125429 | consumed samples: 373760 | consumed tokens: 765460480 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.065273E+00 | grad norm: 0.405 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.066 | TFLOPs: 41.16 | 15: iteration 1470/ 125429 | consumed samples: 376320 | consumed tokens: 770703360 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.025066E+00 | grad norm: 0.375 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.729 | TFLOPs: 39.29 | 15: iteration 1480/ 125429 | consumed samples: 378880 | consumed tokens: 775946240 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.044939E+00 | grad norm: 0.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.400 | TFLOPs: 41.22 | 15: iteration 1490/ 125429 | consumed samples: 381440 | consumed tokens: 781189120 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.058204E+00 | grad norm: 0.427 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.590 | TFLOPs: 40.09 | 15: iteration 1500/ 125429 | consumed samples: 384000 | consumed tokens: 786432000 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.030007E+00 | grad norm: 0.420 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.335 | TFLOPs: 41.04 | 15: iteration 1510/ 125429 | consumed samples: 386560 | consumed tokens: 791674880 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.031647E+00 | grad norm: 0.403 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.782 | TFLOPs: 40.29 | 15: iteration 1520/ 125429 | consumed samples: 389120 | consumed tokens: 796917760 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.035237E+00 | grad norm: 0.647 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.401 | TFLOPs: 40.22 | 15: iteration 1530/ 125429 | consumed samples: 391680 | consumed tokens: 802160640 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.028982E+00 | grad norm: 0.384 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.386 | TFLOPs: 40.39 | 15: iteration 1540/ 125429 | consumed samples: 394240 | consumed tokens: 807403520 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.030064E+00 | grad norm: 0.342 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.928 | TFLOPs: 39.82 | 15: iteration 1550/ 125429 | consumed samples: 396800 | consumed tokens: 812646400 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.996938E+00 | grad norm: 0.382 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.951 | TFLOPs: 41.14 | 15: iteration 1560/ 125429 | consumed samples: 399360 | consumed tokens: 817889280 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.976325E+00 | grad norm: 0.357 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.304 | TFLOPs: 40.54 | 15: iteration 1570/ 125429 | consumed samples: 401920 | consumed tokens: 823132160 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.967584E+00 | grad norm: 0.381 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.734 | TFLOPs: 40.61 | 15: iteration 1580/ 125429 | consumed samples: 404480 | consumed tokens: 828375040 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.988083E+00 | grad norm: 0.329 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.217 | TFLOPs: 40.69 | 15: iteration 1590/ 125429 | consumed samples: 407040 | consumed tokens: 833617920 | elapsed time per iteration (s): 1.11 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.989886E+00 | grad norm: 0.323 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.666 | TFLOPs: 38.28 | 15: iteration 1600/ 125429 | consumed samples: 409600 | consumed tokens: 838860800 | elapsed time per iteration (s): 1.09 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.958012E+00 | grad norm: 0.398 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.893 | TFLOPs: 38.98 | 15: iteration 1610/ 125429 | consumed samples: 412160 | consumed tokens: 844103680 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.940788E+00 | grad norm: 0.372 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.301 | TFLOPs: 39.71 | 15: iteration 1620/ 125429 | consumed samples: 414720 | consumed tokens: 849346560 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.938926E+00 | grad norm: 0.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.344 | TFLOPs: 40.05 | 15: iteration 1630/ 125429 | consumed samples: 417280 | consumed tokens: 854589440 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.950501E+00 | grad norm: 0.347 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.972 | TFLOPs: 40.15 | 15: iteration 1640/ 125429 | consumed samples: 419840 | consumed tokens: 859832320 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.922293E+00 | grad norm: 0.366 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.311 | TFLOPs: 40.70 | 15: iteration 1650/ 125429 | consumed samples: 422400 | consumed tokens: 865075200 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.980951E+00 | grad norm: 0.347 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.994 | TFLOPs: 39.83 | 15: iteration 1660/ 125429 | consumed samples: 424960 | consumed tokens: 870318080 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.908833E+00 | grad norm: 0.352 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.119 | TFLOPs: 40.51 | 15: iteration 1670/ 125429 | consumed samples: 427520 | consumed tokens: 875560960 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.954279E+00 | grad norm: 0.618 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.124 | TFLOPs: 41.00 | 15: iteration 1680/ 125429 | consumed samples: 430080 | consumed tokens: 880803840 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.938260E+00 | grad norm: 0.392 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.594 | TFLOPs: 40.75 | 15: iteration 1690/ 125429 | consumed samples: 432640 | consumed tokens: 886046720 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.971734E+00 | grad norm: 0.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.110 | TFLOPs: 40.51 | 15: iteration 1700/ 125429 | consumed samples: 435200 | consumed tokens: 891289600 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.960511E+00 | grad norm: 0.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.348 | TFLOPs: 40.55 | 15: iteration 1710/ 125429 | consumed samples: 437760 | consumed tokens: 896532480 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.927707E+00 | grad norm: 0.355 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.683 | TFLOPs: 39.94 | 15: iteration 1720/ 125429 | consumed samples: 440320 | consumed tokens: 901775360 | elapsed time per iteration (s): 1.09 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.920874E+00 | grad norm: 0.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.057 | TFLOPs: 38.85 | 15: iteration 1730/ 125429 | consumed samples: 442880 | consumed tokens: 907018240 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.910635E+00 | grad norm: 0.397 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.800 | TFLOPs: 39.79 | 15: iteration 1740/ 125429 | consumed samples: 445440 | consumed tokens: 912261120 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.944188E+00 | grad norm: 0.340 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.506 | TFLOPs: 40.41 | 15: iteration 1750/ 125429 | consumed samples: 448000 | consumed tokens: 917504000 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.917636E+00 | grad norm: 0.344 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.562 | TFLOPs: 39.59 | 15: iteration 1760/ 125429 | consumed samples: 450560 | consumed tokens: 922746880 | elapsed time per iteration (s): 162.65 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.925281E+00 | grad norm: 0.333 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 1.574 | TFLOPs: 0.26 | 15: iteration 1770/ 125429 | consumed samples: 453120 | consumed tokens: 927989760 | elapsed time per iteration (s): 27.84 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.897824E+00 | grad norm: 0.363 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 9.195 | TFLOPs: 1.52 | 15: iteration 1780/ 125429 | consumed samples: 455680 | consumed tokens: 933232640 | elapsed time per iteration (s): 1.02 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.885069E+00 | grad norm: 0.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.912 | TFLOPs: 41.63 | 15: iteration 1790/ 125429 | consumed samples: 458240 | consumed tokens: 938475520 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.933791E+00 | grad norm: 0.353 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.571 | TFLOPs: 40.91 | 15: iteration 1800/ 125429 | consumed samples: 460800 | consumed tokens: 943718400 | elapsed time per iteration (s): 1.09 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.873332E+00 | grad norm: 0.325 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.567 | TFLOPs: 38.93 | 15: iteration 1810/ 125429 | consumed samples: 463360 | consumed tokens: 948961280 | elapsed time per iteration (s): 1.66 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.901317E+00 | grad norm: 0.322 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 153.958 | TFLOPs: 25.44 | 15: iteration 1820/ 125429 | consumed samples: 465920 | consumed tokens: 954204160 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.882446E+00 | grad norm: 0.328 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.455 | TFLOPs: 40.07 | 15: iteration 1830/ 125429 | consumed samples: 468480 | consumed tokens: 959447040 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.872231E+00 | grad norm: 0.615 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.514 | TFLOPs: 39.91 | 15: iteration 1840/ 125429 | consumed samples: 471040 | consumed tokens: 964689920 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.883634E+00 | grad norm: 0.695 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.834 | TFLOPs: 40.63 | 15: iteration 1850/ 125429 | consumed samples: 473600 | consumed tokens: 969932800 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.964803E+00 | grad norm: 1.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.566 | TFLOPs: 39.26 | 15: iteration 1860/ 125429 | consumed samples: 476160 | consumed tokens: 975175680 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.057933E+00 | grad norm: 1.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.832 | TFLOPs: 39.96 | 15: iteration 1870/ 125429 | consumed samples: 478720 | consumed tokens: 980418560 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.061565E+00 | grad norm: 1.063 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.968 | TFLOPs: 40.32 | 15: iteration 1880/ 125429 | consumed samples: 481280 | consumed tokens: 985661440 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.053325E+00 | grad norm: 0.734 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.170 | TFLOPs: 40.35 | 15: iteration 1890/ 125429 | consumed samples: 483840 | consumed tokens: 990904320 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.111176E+00 | grad norm: 0.861 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.007 | TFLOPs: 39.17 | 15: iteration 1900/ 125429 | consumed samples: 486400 | consumed tokens: 996147200 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.958642E+00 | grad norm: 0.387 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.550 | TFLOPs: 40.91 | 15: iteration 1910/ 125429 | consumed samples: 488960 | consumed tokens: 1001390080 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.956628E+00 | grad norm: 0.307 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.495 | TFLOPs: 40.90 | 15: iteration 1920/ 125429 | consumed samples: 491520 | consumed tokens: 1006632960 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.892372E+00 | grad norm: 0.297 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.495 | TFLOPs: 40.40 | 15: iteration 1930/ 125429 | consumed samples: 494080 | consumed tokens: 1011875840 | elapsed time per iteration (s): 1.02 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.889192E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.844 | TFLOPs: 41.62 | 15: iteration 1940/ 125429 | consumed samples: 496640 | consumed tokens: 1017118720 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.869718E+00 | grad norm: 0.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.646 | TFLOPs: 39.27 | 15: iteration 1950/ 125429 | consumed samples: 499200 | consumed tokens: 1022361600 | elapsed time per iteration (s): 1.63 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.842949E+00 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 156.916 | TFLOPs: 25.93 | 15: iteration 1960/ 125429 | consumed samples: 501760 | consumed tokens: 1027604480 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.892638E+00 | grad norm: 0.286 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.250 | TFLOPs: 41.19 | 15: iteration 1970/ 125429 | consumed samples: 504320 | consumed tokens: 1032847360 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.891969E+00 | grad norm: 0.299 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.829 | TFLOPs: 40.79 | 15: iteration 1980/ 125429 | consumed samples: 506880 | consumed tokens: 1038090240 | elapsed time per iteration (s): 5.51 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.857706E+00 | grad norm: 0.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 46.496 | TFLOPs: 7.68 | 15: iteration 1990/ 125429 | consumed samples: 509440 | consumed tokens: 1043333120 | elapsed time per iteration (s): 1.02 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.882189E+00 | grad norm: 0.274 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.799 | TFLOPs: 41.28 | 0: [2022-11-25 20:21:17,032] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=0, lr=[0.00019998398327650732, 0.00019998398327650732, 0.00019998398327650732], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 2000/ 125429 | consumed samples: 512000 | consumed tokens: 1048576000 | elapsed time per iteration (s): 1.02 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.868216E+00 | grad norm: 0.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.389 | TFLOPs: 41.54 | 0: steps: 2000 loss: 2.9428 iter time (s): 2.037 samples/sec: 125.669 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 2000 | lm loss value: 2.755565E+00 | lm loss PPL: 1.572992E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 2000 to checkpoints_1b5 0: [2022-11-25 20:21:17,420] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step2000 is begin to save! 0: [2022-11-25 20:21:17,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_01-model_00-model_states.pt... 0: [2022-11-25 20:21:17,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_01-model_00-model_states.pt. 0: [2022-11-25 20:21:17,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_03-model_00-model_states.pt... 0: [2022-11-25 20:21:17,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_03-model_00-model_states.pt. 0: [2022-11-25 20:21:17,813] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_04-model_00-model_states.pt... 0: [2022-11-25 20:21:17,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_04-model_00-model_states.pt. 0: [2022-11-25 20:21:17,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_05-model_00-model_states.pt... 0: [2022-11-25 20:21:18,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_05-model_00-model_states.pt. 0: [2022-11-25 20:21:18,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_06-model_00-model_states.pt... 0: [2022-11-25 20:21:18,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_06-model_00-model_states.pt. 0: [2022-11-25 20:21:18,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_07-model_00-model_states.pt... 0: [2022-11-25 20:21:18,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_07-model_00-model_states.pt. 0: [2022-11-25 20:21:18,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_08-model_00-model_states.pt... 0: [2022-11-25 20:21:18,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_08-model_00-model_states.pt. 0: [2022-11-25 20:21:18,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_09-model_00-model_states.pt... 0: [2022-11-25 20:21:18,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_09-model_00-model_states.pt. 0: [2022-11-25 20:21:18,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_10-model_00-model_states.pt... 0: [2022-11-25 20:21:18,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_10-model_00-model_states.pt. 0: [2022-11-25 20:21:18,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_11-model_00-model_states.pt... 0: [2022-11-25 20:21:18,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_11-model_00-model_states.pt. 0: [2022-11-25 20:21:18,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_12-model_00-model_states.pt... 0: [2022-11-25 20:21:18,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_12-model_00-model_states.pt. 0: [2022-11-25 20:21:18,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_13-model_00-model_states.pt... 0: [2022-11-25 20:21:18,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_13-model_00-model_states.pt. 0: [2022-11-25 20:21:18,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_14-model_00-model_states.pt... 0: [2022-11-25 20:21:19,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_14-model_00-model_states.pt. 0: [2022-11-25 20:21:19,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_15-model_00-model_states.pt... 0: [2022-11-25 20:21:19,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_15-model_00-model_states.pt. 0: [2022-11-25 20:21:19,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_16-model_00-model_states.pt... 0: [2022-11-25 20:21:19,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_16-model_00-model_states.pt. 0: [2022-11-25 20:21:19,286] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_17-model_00-model_states.pt... 0: [2022-11-25 20:21:19,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_17-model_00-model_states.pt. 0: [2022-11-25 20:21:19,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_18-model_00-model_states.pt... 0: [2022-11-25 20:21:19,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_18-model_00-model_states.pt. 0: [2022-11-25 20:21:19,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_19-model_00-model_states.pt... 0: [2022-11-25 20:21:19,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_19-model_00-model_states.pt. 0: [2022-11-25 20:21:19,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_20-model_00-model_states.pt... 0: [2022-11-25 20:21:19,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_20-model_00-model_states.pt. 0: [2022-11-25 20:21:19,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_21-model_00-model_states.pt... 0: [2022-11-25 20:21:20,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_21-model_00-model_states.pt. 0: [2022-11-25 20:21:20,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_22-model_00-model_states.pt... 0: [2022-11-25 20:21:20,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_22-model_00-model_states.pt. 0: [2022-11-25 20:21:20,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_23-model_00-model_states.pt... 0: [2022-11-25 20:21:20,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_23-model_00-model_states.pt. 0: [2022-11-25 20:21:20,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_24-model_00-model_states.pt... 0: [2022-11-25 20:21:20,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_24-model_00-model_states.pt. 0: [2022-11-25 20:21:20,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_25-model_00-model_states.pt... 0: [2022-11-25 20:21:20,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_25-model_00-model_states.pt. 0: [2022-11-25 20:21:20,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_26-model_00-model_states.pt... 0: [2022-11-25 20:21:21,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_26-model_00-model_states.pt. 0: [2022-11-25 20:21:21,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_27-model_00-model_states.pt... 0: [2022-11-25 20:21:21,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_27-model_00-model_states.pt. 0: [2022-11-25 20:21:21,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_28-model_00-model_states.pt... 0: [2022-11-25 20:21:21,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_28-model_00-model_states.pt. 0: [2022-11-25 20:21:21,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_29-model_00-model_states.pt... 0: [2022-11-25 20:21:21,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_29-model_00-model_states.pt. 0: [2022-11-25 20:21:21,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_30-model_00-model_states.pt... 0: [2022-11-25 20:21:21,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_30-model_00-model_states.pt. 0: [2022-11-25 20:21:21,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/layer_32-model_00-model_states.pt... 0: [2022-11-25 20:21:21,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/layer_32-model_00-model_states.pt. 0: [2022-11-25 20:21:21,664] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step2000/mp_rank_00_model_states.pt 0: [2022-11-25 20:21:21,664] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/mp_rank_00_model_states.pt... 0: [2022-11-25 20:21:21,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/mp_rank_00_model_states.pt. 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:21:21,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:21:21,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:21:21,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:21,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:21,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 20:21:21,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:21,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:21,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 20:21:21,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:21,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:21,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:21,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:21:21,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:21,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 20:21:21,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:21,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:21,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:21,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 20:21:21,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:21,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:21,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 20:21:21,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:21,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 20:21:21,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 20:21:21,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 20:21:21,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:21,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:21,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:21,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:21,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:21,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:21,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 20:21:21,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 20:21:21,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 20:21:21,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 20:21:21,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:21,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:21,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 20:21:21,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 20:21:21,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:21:21,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:21:21,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:21:21,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 20:21:21,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 10: [2022-11-25 20:21:21,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:21:21,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 20:21:21,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:21,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:21,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:21,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:21,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-25 20:21:21,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 20:21:21,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:21,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 20:21:21,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:21,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 20:21:21,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 20:21:21,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:21,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:21,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:21,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:21,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:21,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:21,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:21,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:21,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:21,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:21,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:21,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 20:21:21,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:21,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 7: [2022-11-25 20:21:21,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 20:21:21,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:21,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 20:21:21,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:21,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:21:21,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:21,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:21,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:21,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:21,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:21,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:21,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: [2022-11-25 20:21:21,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 20:21:21,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 20:21:21,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:21:21,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:21,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:21,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 20:21:21,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:21,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 20:21:21,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:21,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:21:21,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:21,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:21:21,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:21:21,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:21,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:21,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:21:21,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:21:21,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:21:21,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:21,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:21,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:21,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:21,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:21,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:21,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:21,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:21,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:21,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:21,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:21,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 20:21:21,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:21,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:21:21,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 20:21:21,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 6: [2022-11-25 20:21:21,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:21:21,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 20:21:21,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 20:21:21,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:21,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:21,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:21,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:21,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:21,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 11: [2022-11-25 20:21:21,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:21:21,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 20:21:21,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 20:21:21,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:21,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:22,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:21:22,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 20:21:22,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:22,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:22,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:22,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:21:22,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:22,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 20:21:22,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:21:22,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:22,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:22,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:21:22,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:22,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 13: [2022-11-25 20:21:22,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 20:21:22,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:22,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:22,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:22,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:22,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:22,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:22,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:22,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:22,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:22,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:22,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:22,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:22,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 9: [2022-11-25 20:21:22,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:21:22,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 20:21:22,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 20:21:22,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:21:22,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:22,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 14: [2022-11-25 20:21:22,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:21:22,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 20:21:22,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:22,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:22,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:22,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:22,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:22,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:22,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:22,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:21:22,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:22,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 20:21:22,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:21:22,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:22,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 8: [2022-11-25 20:21:22,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:21:22,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 20:21:22,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 12: [2022-11-25 20:21:22,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 4: [2022-11-25 20:21:22,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:21:22,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 20:21:22,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 1: [2022-11-25 20:21:22,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:21:22,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 20:21:22,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:22,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:22,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:22,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 15: [2022-11-25 20:21:22,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:21:22,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 20:21:22,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 2: [2022-11-25 20:21:22,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:21:22,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 20:21:22,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 20:21:22,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 5: [2022-11-25 20:21:22,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:21:22,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 20:21:22,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 3: [2022-11-25 20:21:22,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:21:22,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step2000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 20:21:22,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step2000 is ready now! 0: successfully saved checkpoint at iteration 2000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4786.49 15: iteration 2010/ 125429 | consumed samples: 514560 | consumed tokens: 1053818880 | elapsed time per iteration (s): 1.52 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.876032E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 168.042 | TFLOPs: 27.77 | 15: iteration 2020/ 125429 | consumed samples: 517120 | consumed tokens: 1059061760 | elapsed time per iteration (s): 1.02 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.888860E+00 | grad norm: 0.293 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.792 | TFLOPs: 41.45 | 15: iteration 2030/ 125429 | consumed samples: 519680 | consumed tokens: 1064304640 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.812085E+00 | grad norm: 0.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.702 | TFLOPs: 40.77 | 15: iteration 2040/ 125429 | consumed samples: 522240 | consumed tokens: 1069547520 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.839830E+00 | grad norm: 0.359 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.671 | TFLOPs: 39.11 | 15: iteration 2050/ 125429 | consumed samples: 524800 | consumed tokens: 1074790400 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.870951E+00 | grad norm: 0.323 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.440 | TFLOPs: 39.73 | 15: iteration 2060/ 125429 | consumed samples: 527360 | consumed tokens: 1080033280 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.839783E+00 | grad norm: 0.297 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.603 | TFLOPs: 40.92 | 15: iteration 2070/ 125429 | consumed samples: 529920 | consumed tokens: 1085276160 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.816309E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.880 | TFLOPs: 40.30 | 15: iteration 2080/ 125429 | consumed samples: 532480 | consumed tokens: 1090519040 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.831155E+00 | grad norm: 0.295 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.155 | TFLOPs: 41.01 | 15: iteration 2090/ 125429 | consumed samples: 535040 | consumed tokens: 1095761920 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.821044E+00 | grad norm: 0.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.293 | TFLOPs: 40.54 | 15: iteration 2100/ 125429 | consumed samples: 537600 | consumed tokens: 1101004800 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.812637E+00 | grad norm: 1.042 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.463 | TFLOPs: 41.23 | 15: iteration 2110/ 125429 | consumed samples: 540160 | consumed tokens: 1106247680 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.804312E+00 | grad norm: 0.653 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.984 | TFLOPs: 40.16 | 15: iteration 2120/ 125429 | consumed samples: 542720 | consumed tokens: 1111490560 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.827006E+00 | grad norm: 0.378 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.899 | TFLOPs: 41.13 | 15: iteration 2130/ 125429 | consumed samples: 545280 | consumed tokens: 1116733440 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.866886E+00 | grad norm: 0.595 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.782 | TFLOPs: 40.78 | 15: iteration 2140/ 125429 | consumed samples: 547840 | consumed tokens: 1121976320 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.840657E+00 | grad norm: 0.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.468 | TFLOPs: 40.23 | 15: iteration 2150/ 125429 | consumed samples: 550400 | consumed tokens: 1127219200 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.811992E+00 | grad norm: 0.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.906 | TFLOPs: 41.13 | 15: iteration 2160/ 125429 | consumed samples: 552960 | consumed tokens: 1132462080 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.806549E+00 | grad norm: 0.275 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.362 | TFLOPs: 41.21 | 15: iteration 2170/ 125429 | consumed samples: 555520 | consumed tokens: 1137704960 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.818031E+00 | grad norm: 0.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.687 | TFLOPs: 40.77 | 15: iteration 2180/ 125429 | consumed samples: 558080 | consumed tokens: 1142947840 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.823749E+00 | grad norm: 0.296 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.992 | TFLOPs: 40.82 | 15: iteration 2190/ 125429 | consumed samples: 560640 | consumed tokens: 1148190720 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.799142E+00 | grad norm: 0.283 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.099 | TFLOPs: 40.34 | 15: iteration 2200/ 125429 | consumed samples: 563200 | consumed tokens: 1153433600 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.822140E+00 | grad norm: 0.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.677 | TFLOPs: 39.44 | 15: iteration 2210/ 125429 | consumed samples: 565760 | consumed tokens: 1158676480 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.772568E+00 | grad norm: 0.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.283 | TFLOPs: 40.70 | 15: iteration 2220/ 125429 | consumed samples: 568320 | consumed tokens: 1163919360 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.836100E+00 | grad norm: 0.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.001 | TFLOPs: 39.17 | 15: iteration 2230/ 125429 | consumed samples: 570880 | consumed tokens: 1169162240 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.789077E+00 | grad norm: 0.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.151 | TFLOPs: 40.18 | 15: iteration 2240/ 125429 | consumed samples: 573440 | consumed tokens: 1174405120 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.813329E+00 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.473 | TFLOPs: 40.90 | 15: iteration 2250/ 125429 | consumed samples: 576000 | consumed tokens: 1179648000 | elapsed time per iteration (s): 1.10 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.773615E+00 | grad norm: 0.293 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.523 | TFLOPs: 38.43 | 15: iteration 2260/ 125429 | consumed samples: 578560 | consumed tokens: 1184890880 | elapsed time per iteration (s): 1.09 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.741712E+00 | grad norm: 0.285 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.152 | TFLOPs: 38.70 | 15: iteration 2270/ 125429 | consumed samples: 581120 | consumed tokens: 1190133760 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.744219E+00 | grad norm: 0.394 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.377 | TFLOPs: 39.23 | 15: iteration 2280/ 125429 | consumed samples: 583680 | consumed tokens: 1195376640 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.767163E+00 | grad norm: 0.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.081 | TFLOPs: 40.34 | 15: iteration 2290/ 125429 | consumed samples: 586240 | consumed tokens: 1200619520 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.795622E+00 | grad norm: 0.316 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.344 | TFLOPs: 39.55 | 15: iteration 2300/ 125429 | consumed samples: 588800 | consumed tokens: 1205862400 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.769790E+00 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.978 | TFLOPs: 40.81 | 15: iteration 2310/ 125429 | consumed samples: 591360 | consumed tokens: 1211105280 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.791859E+00 | grad norm: 0.310 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.136 | TFLOPs: 41.01 | 15: iteration 2320/ 125429 | consumed samples: 593920 | consumed tokens: 1216348160 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.780599E+00 | grad norm: 0.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.500 | TFLOPs: 39.58 | 15: iteration 2330/ 125429 | consumed samples: 596480 | consumed tokens: 1221591040 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.710833E+00 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.039 | TFLOPs: 40.16 | 15: iteration 2340/ 125429 | consumed samples: 599040 | consumed tokens: 1226833920 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.723712E+00 | grad norm: 0.310 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.780 | TFLOPs: 40.95 | 15: iteration 2350/ 125429 | consumed samples: 601600 | consumed tokens: 1232076800 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.751819E+00 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.957 | TFLOPs: 39.65 | 15: iteration 2360/ 125429 | consumed samples: 604160 | consumed tokens: 1237319680 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.769155E+00 | grad norm: 0.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.677 | TFLOPs: 40.43 | 15: iteration 2370/ 125429 | consumed samples: 606720 | consumed tokens: 1242562560 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.782604E+00 | grad norm: 0.507 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.816 | TFLOPs: 39.63 | 15: iteration 2380/ 125429 | consumed samples: 609280 | consumed tokens: 1247805440 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.767471E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.202 | TFLOPs: 41.02 | 15: iteration 2390/ 125429 | consumed samples: 611840 | consumed tokens: 1253048320 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.749207E+00 | grad norm: 0.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.150 | TFLOPs: 41.01 | 15: iteration 2400/ 125429 | consumed samples: 614400 | consumed tokens: 1258291200 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.790803E+00 | grad norm: 0.281 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.239 | TFLOPs: 39.21 | 15: iteration 2410/ 125429 | consumed samples: 616960 | consumed tokens: 1263534080 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.750640E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.930 | TFLOPs: 41.14 | 15: iteration 2420/ 125429 | consumed samples: 619520 | consumed tokens: 1268776960 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.742743E+00 | grad norm: 0.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.006 | TFLOPs: 41.15 | 15: iteration 2430/ 125429 | consumed samples: 622080 | consumed tokens: 1274019840 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.746911E+00 | grad norm: 0.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.788 | TFLOPs: 40.62 | 15: iteration 2440/ 125429 | consumed samples: 624640 | consumed tokens: 1279262720 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.719992E+00 | grad norm: 0.355 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.003 | TFLOPs: 40.32 | 15: iteration 2450/ 125429 | consumed samples: 627200 | consumed tokens: 1284505600 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.767698E+00 | grad norm: 0.984 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.383 | TFLOPs: 40.88 | 15: iteration 2460/ 125429 | consumed samples: 629760 | consumed tokens: 1289748480 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.963010E+00 | grad norm: 2.103 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.423 | TFLOPs: 40.23 | 15: iteration 2470/ 125429 | consumed samples: 632320 | consumed tokens: 1294991360 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 3.047968E+00 | grad norm: 1.025 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.762 | TFLOPs: 40.45 | 15: iteration 2480/ 125429 | consumed samples: 634880 | consumed tokens: 1300234240 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.920675E+00 | grad norm: 0.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.152 | TFLOPs: 39.69 | 15: iteration 2490/ 125429 | consumed samples: 637440 | consumed tokens: 1305477120 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.813557E+00 | grad norm: 0.303 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.843 | TFLOPs: 40.79 | 15: iteration 2500/ 125429 | consumed samples: 640000 | consumed tokens: 1310720000 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.834762E+00 | grad norm: 0.389 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.480 | TFLOPs: 40.40 | 15: iteration 2510/ 125429 | consumed samples: 642560 | consumed tokens: 1315962880 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.812403E+00 | grad norm: 0.287 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.587 | TFLOPs: 39.92 | 15: iteration 2520/ 125429 | consumed samples: 645120 | consumed tokens: 1321205760 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.752736E+00 | grad norm: 0.247 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.761 | TFLOPs: 40.12 | 15: iteration 2530/ 125429 | consumed samples: 647680 | consumed tokens: 1326448640 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.757143E+00 | grad norm: 0.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.470 | TFLOPs: 40.90 | 15: iteration 2540/ 125429 | consumed samples: 650240 | consumed tokens: 1331691520 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.743288E+00 | grad norm: 0.311 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.184 | TFLOPs: 40.85 | 15: iteration 2550/ 125429 | consumed samples: 652800 | consumed tokens: 1336934400 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.721430E+00 | grad norm: 0.273 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.190 | TFLOPs: 41.18 | 15: iteration 2560/ 125429 | consumed samples: 655360 | consumed tokens: 1342177280 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.724243E+00 | grad norm: 0.277 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.892 | TFLOPs: 40.14 | 15: iteration 2570/ 125429 | consumed samples: 657920 | consumed tokens: 1347420160 | elapsed time per iteration (s): 1.02 | learning rate: 2.000E-04 | global batch size: 256 | lm loss: 2.733447E+00 | grad norm: 0.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.651 | TFLOPs: 41.42 | 15: iteration 2580/ 125429 | consumed samples: 660480 | consumed tokens: 1352663040 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.734911E+00 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.443 | TFLOPs: 40.73 | 15: iteration 2590/ 125429 | consumed samples: 663040 | consumed tokens: 1357905920 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.754182E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.802 | TFLOPs: 40.62 | 15: iteration 2600/ 125429 | consumed samples: 665600 | consumed tokens: 1363148800 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.700789E+00 | grad norm: 0.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.118 | TFLOPs: 40.67 | 15: iteration 2610/ 125429 | consumed samples: 668160 | consumed tokens: 1368391680 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.735114E+00 | grad norm: 0.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.137 | TFLOPs: 40.68 | 15: iteration 2620/ 125429 | consumed samples: 670720 | consumed tokens: 1373634560 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.741293E+00 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.885 | TFLOPs: 41.13 | 15: iteration 2630/ 125429 | consumed samples: 673280 | consumed tokens: 1378877440 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.713603E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.622 | TFLOPs: 40.43 | 15: iteration 2640/ 125429 | consumed samples: 675840 | consumed tokens: 1384120320 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.757685E+00 | grad norm: 0.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.867 | TFLOPs: 40.80 | 15: iteration 2650/ 125429 | consumed samples: 678400 | consumed tokens: 1389363200 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.729668E+00 | grad norm: 0.309 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.393 | TFLOPs: 40.55 | 15: iteration 2660/ 125429 | consumed samples: 680960 | consumed tokens: 1394606080 | elapsed time per iteration (s): 1.10 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.703207E+00 | grad norm: 0.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.804 | TFLOPs: 38.31 | 15: iteration 2670/ 125429 | consumed samples: 683520 | consumed tokens: 1399848960 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.709284E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.434 | TFLOPs: 40.89 | 15: iteration 2680/ 125429 | consumed samples: 686080 | consumed tokens: 1405091840 | elapsed time per iteration (s): 1.02 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.672961E+00 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.288 | TFLOPs: 41.36 | 15: iteration 2690/ 125429 | consumed samples: 688640 | consumed tokens: 1410334720 | elapsed time per iteration (s): 1.07 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.696598E+00 | grad norm: 0.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.120 | TFLOPs: 39.68 | 15: iteration 2700/ 125429 | consumed samples: 691200 | consumed tokens: 1415577600 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.707579E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.454 | TFLOPs: 40.89 | 15: iteration 2710/ 125429 | consumed samples: 693760 | consumed tokens: 1420820480 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.667362E+00 | grad norm: 0.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.319 | TFLOPs: 40.38 | 15: iteration 2720/ 125429 | consumed samples: 696320 | consumed tokens: 1426063360 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.724038E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.421 | TFLOPs: 41.22 | 15: iteration 2730/ 125429 | consumed samples: 698880 | consumed tokens: 1431306240 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.681181E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.280 | TFLOPs: 40.04 | 15: iteration 2740/ 125429 | consumed samples: 701440 | consumed tokens: 1436549120 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.657249E+00 | grad norm: 0.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.462 | TFLOPs: 41.06 | 15: iteration 2750/ 125429 | consumed samples: 704000 | consumed tokens: 1441792000 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.666069E+00 | grad norm: 0.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.887 | TFLOPs: 41.13 | 15: iteration 2760/ 125429 | consumed samples: 706560 | consumed tokens: 1447034880 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.678693E+00 | grad norm: 0.259 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.568 | TFLOPs: 39.76 | 15: iteration 2770/ 125429 | consumed samples: 709120 | consumed tokens: 1452277760 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.701950E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.934 | TFLOPs: 40.48 | 15: iteration 2780/ 125429 | consumed samples: 711680 | consumed tokens: 1457520640 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.690622E+00 | grad norm: 0.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.675 | TFLOPs: 40.93 | 15: iteration 2790/ 125429 | consumed samples: 714240 | consumed tokens: 1462763520 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.702385E+00 | grad norm: 0.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.142 | TFLOPs: 40.18 | 15: iteration 2800/ 125429 | consumed samples: 716800 | consumed tokens: 1468006400 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.722875E+00 | grad norm: 0.256 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.829 | TFLOPs: 40.63 | 15: iteration 2810/ 125429 | consumed samples: 719360 | consumed tokens: 1473249280 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.642575E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.393 | TFLOPs: 40.06 | 15: iteration 2820/ 125429 | consumed samples: 721920 | consumed tokens: 1478492160 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.713928E+00 | grad norm: 0.247 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.715 | TFLOPs: 41.27 | 15: iteration 2830/ 125429 | consumed samples: 724480 | consumed tokens: 1483735040 | elapsed time per iteration (s): 1.07 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.706139E+00 | grad norm: 0.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.045 | TFLOPs: 39.50 | 15: iteration 2840/ 125429 | consumed samples: 727040 | consumed tokens: 1488977920 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.662421E+00 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.376 | TFLOPs: 39.89 | 15: iteration 2850/ 125429 | consumed samples: 729600 | consumed tokens: 1494220800 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.684141E+00 | grad norm: 0.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.113 | TFLOPs: 41.00 | 15: iteration 2860/ 125429 | consumed samples: 732160 | consumed tokens: 1499463680 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.678798E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.266 | TFLOPs: 41.19 | 15: iteration 2870/ 125429 | consumed samples: 734720 | consumed tokens: 1504706560 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.688556E+00 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.955 | TFLOPs: 40.48 | 15: iteration 2880/ 125429 | consumed samples: 737280 | consumed tokens: 1509949440 | elapsed time per iteration (s): 1.12 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.677081E+00 | grad norm: 0.255 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.221 | TFLOPs: 37.72 | 15: iteration 2890/ 125429 | consumed samples: 739840 | consumed tokens: 1515192320 | elapsed time per iteration (s): 2.68 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.659905E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 95.568 | TFLOPs: 15.79 | 15: iteration 2900/ 125429 | consumed samples: 742400 | consumed tokens: 1520435200 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.665511E+00 | grad norm: 0.304 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.130 | TFLOPs: 40.84 | 15: iteration 2910/ 125429 | consumed samples: 744960 | consumed tokens: 1525678080 | elapsed time per iteration (s): 1.02 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.676716E+00 | grad norm: 0.269 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.494 | TFLOPs: 41.40 | 15: iteration 2920/ 125429 | consumed samples: 747520 | consumed tokens: 1530920960 | elapsed time per iteration (s): 2.29 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.641632E+00 | grad norm: 0.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 111.614 | TFLOPs: 18.45 | 15: iteration 2930/ 125429 | consumed samples: 750080 | consumed tokens: 1536163840 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.634275E+00 | grad norm: 0.276 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.384 | TFLOPs: 40.06 | 15: iteration 2940/ 125429 | consumed samples: 752640 | consumed tokens: 1541406720 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.641375E+00 | grad norm: 0.255 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.073 | TFLOPs: 40.17 | 15: iteration 2950/ 125429 | consumed samples: 755200 | consumed tokens: 1546649600 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.632627E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.919 | TFLOPs: 40.97 | 15: iteration 2960/ 125429 | consumed samples: 757760 | consumed tokens: 1551892480 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.682094E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.607 | TFLOPs: 40.59 | 15: iteration 2970/ 125429 | consumed samples: 760320 | consumed tokens: 1557135360 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.688406E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.753 | TFLOPs: 41.27 | 15: iteration 2980/ 125429 | consumed samples: 762880 | consumed tokens: 1562378240 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.662352E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.402 | TFLOPs: 40.22 | 15: iteration 2990/ 125429 | consumed samples: 765440 | consumed tokens: 1567621120 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.652304E+00 | grad norm: 0.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.818 | TFLOPs: 41.12 | 15: iteration 3000/ 125429 | consumed samples: 768000 | consumed tokens: 1572864000 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.654334E+00 | grad norm: 0.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.944 | TFLOPs: 40.31 | 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 3000 | lm loss value: 2.603257E+00 | lm loss PPL: 1.350767E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 3000 to checkpoints_1b5 0: [2022-11-25 20:39:17,498] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step3000 is begin to save! 0: [2022-11-25 20:39:17,505] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_01-model_00-model_states.pt... 0: [2022-11-25 20:39:17,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_01-model_00-model_states.pt. 0: [2022-11-25 20:39:17,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_03-model_00-model_states.pt... 0: [2022-11-25 20:39:17,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_03-model_00-model_states.pt. 0: [2022-11-25 20:39:17,835] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_04-model_00-model_states.pt... 0: [2022-11-25 20:39:17,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_04-model_00-model_states.pt. 0: [2022-11-25 20:39:17,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_05-model_00-model_states.pt... 0: [2022-11-25 20:39:18,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_05-model_00-model_states.pt. 0: [2022-11-25 20:39:18,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_06-model_00-model_states.pt... 0: [2022-11-25 20:39:18,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_06-model_00-model_states.pt. 0: [2022-11-25 20:39:18,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_07-model_00-model_states.pt... 0: [2022-11-25 20:39:18,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_07-model_00-model_states.pt. 0: [2022-11-25 20:39:18,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_08-model_00-model_states.pt... 0: [2022-11-25 20:39:18,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_08-model_00-model_states.pt. 0: [2022-11-25 20:39:18,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_09-model_00-model_states.pt... 0: [2022-11-25 20:39:18,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_09-model_00-model_states.pt. 0: [2022-11-25 20:39:18,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_10-model_00-model_states.pt... 0: [2022-11-25 20:39:18,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_10-model_00-model_states.pt. 0: [2022-11-25 20:39:18,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_11-model_00-model_states.pt... 0: [2022-11-25 20:39:18,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_11-model_00-model_states.pt. 0: [2022-11-25 20:39:18,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_12-model_00-model_states.pt... 0: [2022-11-25 20:39:18,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_12-model_00-model_states.pt. 0: [2022-11-25 20:39:18,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_13-model_00-model_states.pt... 0: [2022-11-25 20:39:18,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_13-model_00-model_states.pt. 0: [2022-11-25 20:39:18,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_14-model_00-model_states.pt... 0: [2022-11-25 20:39:18,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_14-model_00-model_states.pt. 0: [2022-11-25 20:39:18,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_15-model_00-model_states.pt... 0: [2022-11-25 20:39:19,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_15-model_00-model_states.pt. 0: [2022-11-25 20:39:19,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_16-model_00-model_states.pt... 0: [2022-11-25 20:39:19,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_16-model_00-model_states.pt. 0: [2022-11-25 20:39:19,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_17-model_00-model_states.pt... 0: [2022-11-25 20:39:19,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_17-model_00-model_states.pt. 0: [2022-11-25 20:39:19,256] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_18-model_00-model_states.pt... 0: [2022-11-25 20:39:19,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_18-model_00-model_states.pt. 0: [2022-11-25 20:39:19,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_19-model_00-model_states.pt... 0: [2022-11-25 20:39:19,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_19-model_00-model_states.pt. 0: [2022-11-25 20:39:19,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_20-model_00-model_states.pt... 0: [2022-11-25 20:39:19,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_20-model_00-model_states.pt. 0: [2022-11-25 20:39:19,561] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_21-model_00-model_states.pt... 0: [2022-11-25 20:39:19,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_21-model_00-model_states.pt. 0: [2022-11-25 20:39:19,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_22-model_00-model_states.pt... 0: [2022-11-25 20:39:19,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_22-model_00-model_states.pt. 0: [2022-11-25 20:39:19,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_23-model_00-model_states.pt... 0: [2022-11-25 20:39:19,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_23-model_00-model_states.pt. 0: [2022-11-25 20:39:19,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_24-model_00-model_states.pt... 0: [2022-11-25 20:39:19,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_24-model_00-model_states.pt. 0: [2022-11-25 20:39:19,967] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_25-model_00-model_states.pt... 0: [2022-11-25 20:39:20,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_25-model_00-model_states.pt. 0: [2022-11-25 20:39:20,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_26-model_00-model_states.pt... 0: [2022-11-25 20:39:20,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_26-model_00-model_states.pt. 0: [2022-11-25 20:39:20,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_27-model_00-model_states.pt... 0: [2022-11-25 20:39:20,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_27-model_00-model_states.pt. 0: [2022-11-25 20:39:20,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_28-model_00-model_states.pt... 0: [2022-11-25 20:39:20,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_28-model_00-model_states.pt. 0: [2022-11-25 20:39:20,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_29-model_00-model_states.pt... 0: [2022-11-25 20:39:20,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_29-model_00-model_states.pt. 0: [2022-11-25 20:39:20,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_30-model_00-model_states.pt... 0: [2022-11-25 20:39:20,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_30-model_00-model_states.pt. 0: [2022-11-25 20:39:20,578] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/layer_32-model_00-model_states.pt... 0: [2022-11-25 20:39:20,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/layer_32-model_00-model_states.pt. 0: [2022-11-25 20:39:20,585] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step3000/mp_rank_00_model_states.pt 0: [2022-11-25 20:39:20,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/mp_rank_00_model_states.pt... 0: [2022-11-25 20:39:20,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/mp_rank_00_model_states.pt. 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:39:20,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step3000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:39:20,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 20:39:20,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 20:39:20,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 20:39:20,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 20:39:20,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:39:20,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:39:20,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:39:20,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:39:20,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:39:20,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:39:20,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:39:20,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:39:20,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:39:20,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:39:20,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:39:20,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 15: [2022-11-25 20:39:20,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 20:39:20,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 20:39:20,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 12: [2022-11-25 20:39:20,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:39:20,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 20:39:20,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 20:39:20,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 20:39:20,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 20:39:20,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 20:39:20,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:39:20,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:39:20,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 20:39:20,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:20,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:20,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:20,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:20,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:20,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:20,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 20:39:20,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 20:39:20,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:20,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:20,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:20,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:20,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:20,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:20,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:20,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:20,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:20,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 20:39:20,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 14: [2022-11-25 20:39:20,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:39:20,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 20:39:20,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 20:39:20,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:20,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:20,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:20,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 20:39:20,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:39:20,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:39:20,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:39:20,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 13: [2022-11-25 20:39:20,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:39:20,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 20:39:20,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 11: [2022-11-25 20:39:20,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 20:39:20,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:20,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:20,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:20,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 20:39:20,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 20:39:20,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 2: [2022-11-25 20:39:20,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:39:20,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 20:39:20,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: [2022-11-25 20:39:20,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:39:20,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 20:39:20,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:20,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:20,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:20,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:20,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:20,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:20,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:20,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 5: [2022-11-25 20:39:20,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:39:20,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 20:39:20,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 6: [2022-11-25 20:39:20,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:39:20,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 20:39:20,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 3: [2022-11-25 20:39:20,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:39:20,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 20:39:20,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:20,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:20,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:20,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 9: [2022-11-25 20:39:21,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 4: [2022-11-25 20:39:20,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 10: [2022-11-25 20:39:21,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:39:21,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 20:39:21,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:21,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:21,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:21,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 7: [2022-11-25 20:39:21,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:39:21,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 20:39:21,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 1: [2022-11-25 20:39:21,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:39:21,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 20:39:21,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 8: [2022-11-25 20:39:21,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:39:21,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step3000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 20:39:21,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step3000 is ready now! 0: successfully saved checkpoint at iteration 3000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3605.79 15: iteration 3010/ 125429 | consumed samples: 770560 | consumed tokens: 1578106880 | elapsed time per iteration (s): 1.41 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.634554E+00 | grad norm: 0.257 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 181.524 | TFLOPs: 30.00 | 15: iteration 3020/ 125429 | consumed samples: 773120 | consumed tokens: 1583349760 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.653684E+00 | grad norm: 0.254 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.803 | TFLOPs: 39.96 | 15: iteration 3030/ 125429 | consumed samples: 775680 | consumed tokens: 1588592640 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.612203E+00 | grad norm: 0.263 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.581 | TFLOPs: 40.42 | 15: iteration 3040/ 125429 | consumed samples: 778240 | consumed tokens: 1593835520 | elapsed time per iteration (s): 1.02 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.649792E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.799 | TFLOPs: 41.28 | 15: iteration 3050/ 125429 | consumed samples: 780800 | consumed tokens: 1599078400 | elapsed time per iteration (s): 1.07 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.653508E+00 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.835 | TFLOPs: 39.63 | 15: iteration 3060/ 125429 | consumed samples: 783360 | consumed tokens: 1604321280 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.671455E+00 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.517 | TFLOPs: 41.07 | 15: iteration 3070/ 125429 | consumed samples: 785920 | consumed tokens: 1609564160 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.672106E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.404 | TFLOPs: 41.22 | 15: iteration 3080/ 125429 | consumed samples: 788480 | consumed tokens: 1614807040 | elapsed time per iteration (s): 1.08 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.643450E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.196 | TFLOPs: 39.03 | 15: iteration 3090/ 125429 | consumed samples: 791040 | consumed tokens: 1620049920 | elapsed time per iteration (s): 1.07 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.641613E+00 | grad norm: 0.286 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.236 | TFLOPs: 39.37 | 15: iteration 3100/ 125429 | consumed samples: 793600 | consumed tokens: 1625292800 | elapsed time per iteration (s): 1.09 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.669092E+00 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.191 | TFLOPs: 38.87 | 15: iteration 3110/ 125429 | consumed samples: 796160 | consumed tokens: 1630535680 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.609155E+00 | grad norm: 0.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.291 | TFLOPs: 40.21 | 15: iteration 3120/ 125429 | consumed samples: 798720 | consumed tokens: 1635778560 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.648824E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.646 | TFLOPs: 40.59 | 15: iteration 3130/ 125429 | consumed samples: 801280 | consumed tokens: 1641021440 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.641558E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.827 | TFLOPs: 41.12 | 15: iteration 3140/ 125429 | consumed samples: 803840 | consumed tokens: 1646264320 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.616629E+00 | grad norm: 0.303 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.899 | TFLOPs: 41.13 | 15: iteration 3150/ 125429 | consumed samples: 806400 | consumed tokens: 1651507200 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.677077E+00 | grad norm: 0.332 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.953 | TFLOPs: 40.32 | 15: iteration 3160/ 125429 | consumed samples: 808960 | consumed tokens: 1656750080 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.621725E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.210 | TFLOPs: 40.52 | 15: iteration 3170/ 125429 | consumed samples: 811520 | consumed tokens: 1661992960 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.627551E+00 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.340 | TFLOPs: 40.71 | 15: iteration 3180/ 125429 | consumed samples: 814080 | consumed tokens: 1667235840 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.641497E+00 | grad norm: 0.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.651 | TFLOPs: 40.43 | 15: iteration 3190/ 125429 | consumed samples: 816640 | consumed tokens: 1672478720 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.597163E+00 | grad norm: 0.250 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.056 | TFLOPs: 41.16 | 15: iteration 3200/ 125429 | consumed samples: 819200 | consumed tokens: 1677721600 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.601024E+00 | grad norm: 0.227 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.823 | TFLOPs: 40.95 | 15: iteration 3210/ 125429 | consumed samples: 821760 | consumed tokens: 1682964480 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.650046E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.988 | TFLOPs: 40.49 | 15: iteration 3220/ 125429 | consumed samples: 824320 | consumed tokens: 1688207360 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.610338E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.097 | TFLOPs: 40.34 | 15: iteration 3230/ 125429 | consumed samples: 826880 | consumed tokens: 1693450240 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.634532E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.724 | TFLOPs: 40.61 | 15: iteration 3240/ 125429 | consumed samples: 829440 | consumed tokens: 1698693120 | elapsed time per iteration (s): 1.02 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.609819E+00 | grad norm: 0.242 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.955 | TFLOPs: 41.47 | 15: iteration 3250/ 125429 | consumed samples: 832000 | consumed tokens: 1703936000 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.640059E+00 | grad norm: 1.019 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.657 | TFLOPs: 41.09 | 15: iteration 3260/ 125429 | consumed samples: 834560 | consumed tokens: 1709178880 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.825926E+00 | grad norm: 1.583 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.458 | TFLOPs: 39.90 | 15: iteration 3270/ 125429 | consumed samples: 837120 | consumed tokens: 1714421760 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.867954E+00 | grad norm: 1.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.166 | TFLOPs: 40.35 | 15: iteration 3280/ 125429 | consumed samples: 839680 | consumed tokens: 1719664640 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.800300E+00 | grad norm: 0.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.282 | TFLOPs: 41.03 | 15: iteration 3290/ 125429 | consumed samples: 842240 | consumed tokens: 1724907520 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.687621E+00 | grad norm: 0.735 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.583 | TFLOPs: 41.08 | 15: iteration 3300/ 125429 | consumed samples: 844800 | consumed tokens: 1730150400 | elapsed time per iteration (s): 1.08 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.768982E+00 | grad norm: 0.508 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.030 | TFLOPs: 39.17 | 15: iteration 3310/ 125429 | consumed samples: 847360 | consumed tokens: 1735393280 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.698461E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.256 | TFLOPs: 41.03 | 15: iteration 3320/ 125429 | consumed samples: 849920 | consumed tokens: 1740636160 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.692356E+00 | grad norm: 0.261 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.359 | TFLOPs: 40.71 | 15: iteration 3330/ 125429 | consumed samples: 852480 | consumed tokens: 1745879040 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.687813E+00 | grad norm: 0.248 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.769 | TFLOPs: 40.45 | 15: iteration 3340/ 125429 | consumed samples: 855040 | consumed tokens: 1751121920 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.670122E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.494 | TFLOPs: 41.07 | 15: iteration 3350/ 125429 | consumed samples: 857600 | consumed tokens: 1756364800 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.652379E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.158 | TFLOPs: 40.35 | 15: iteration 3360/ 125429 | consumed samples: 860160 | consumed tokens: 1761607680 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.638760E+00 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.571 | TFLOPs: 40.75 | 15: iteration 3370/ 125429 | consumed samples: 862720 | consumed tokens: 1766850560 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.655571E+00 | grad norm: 0.241 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.925 | TFLOPs: 40.97 | 15: iteration 3380/ 125429 | consumed samples: 865280 | consumed tokens: 1772093440 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.634832E+00 | grad norm: 0.239 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.470 | TFLOPs: 40.40 | 15: iteration 3390/ 125429 | consumed samples: 867840 | consumed tokens: 1777336320 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.628115E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.406 | TFLOPs: 40.72 | 15: iteration 3400/ 125429 | consumed samples: 870400 | consumed tokens: 1782579200 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.600628E+00 | grad norm: 0.243 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.440 | TFLOPs: 39.73 | 15: iteration 3410/ 125429 | consumed samples: 872960 | consumed tokens: 1787822080 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.661228E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.772 | TFLOPs: 40.45 | 15: iteration 3420/ 125429 | consumed samples: 875520 | consumed tokens: 1793064960 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.645332E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.322 | TFLOPs: 40.21 | 15: iteration 3430/ 125429 | consumed samples: 878080 | consumed tokens: 1798307840 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.601393E+00 | grad norm: 0.228 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.223 | TFLOPs: 40.52 | 15: iteration 3440/ 125429 | consumed samples: 880640 | consumed tokens: 1803550720 | elapsed time per iteration (s): 1.07 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.572898E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.036 | TFLOPs: 39.67 | 15: iteration 3450/ 125429 | consumed samples: 883200 | consumed tokens: 1808793600 | elapsed time per iteration (s): 1.04 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.633377E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.668 | TFLOPs: 40.76 | 15: iteration 3460/ 125429 | consumed samples: 885760 | consumed tokens: 1814036480 | elapsed time per iteration (s): 1.09 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.605748E+00 | grad norm: 0.319 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.663 | TFLOPs: 38.78 | 15: iteration 3470/ 125429 | consumed samples: 888320 | consumed tokens: 1819279360 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.635541E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.469 | TFLOPs: 40.07 | 15: iteration 3480/ 125429 | consumed samples: 890880 | consumed tokens: 1824522240 | elapsed time per iteration (s): 1.06 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.623159E+00 | grad norm: 0.312 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.392 | TFLOPs: 39.89 | 15: iteration 3490/ 125429 | consumed samples: 893440 | consumed tokens: 1829765120 | elapsed time per iteration (s): 1.10 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.608544E+00 | grad norm: 0.358 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.380 | TFLOPs: 38.40 | 15: iteration 3500/ 125429 | consumed samples: 896000 | consumed tokens: 1835008000 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.608385E+00 | grad norm: 0.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.351 | TFLOPs: 41.21 | 15: iteration 3510/ 125429 | consumed samples: 898560 | consumed tokens: 1840250880 | elapsed time per iteration (s): 1.05 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.616933E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.954 | TFLOPs: 40.15 | 15: iteration 3520/ 125429 | consumed samples: 901120 | consumed tokens: 1845493760 | elapsed time per iteration (s): 1.07 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.591160E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.323 | TFLOPs: 39.55 | 15: iteration 3530/ 125429 | consumed samples: 903680 | consumed tokens: 1850736640 | elapsed time per iteration (s): 1.03 | learning rate: 1.999E-04 | global batch size: 256 | lm loss: 2.598941E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.405 | TFLOPs: 40.89 | 15: iteration 3540/ 125429 | consumed samples: 906240 | consumed tokens: 1855979520 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.627186E+00 | grad norm: 0.613 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.143 | TFLOPs: 39.85 | 15: iteration 3550/ 125429 | consumed samples: 908800 | consumed tokens: 1861222400 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.698258E+00 | grad norm: 0.539 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.985 | TFLOPs: 40.16 | 15: iteration 3560/ 125429 | consumed samples: 911360 | consumed tokens: 1866465280 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.670584E+00 | grad norm: 0.291 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.819 | TFLOPs: 40.13 | 15: iteration 3570/ 125429 | consumed samples: 913920 | consumed tokens: 1871708160 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.634072E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.107 | TFLOPs: 39.84 | 15: iteration 3580/ 125429 | consumed samples: 916480 | consumed tokens: 1876951040 | elapsed time per iteration (s): 1.07 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.636087E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.391 | TFLOPs: 39.56 | 15: iteration 3590/ 125429 | consumed samples: 919040 | consumed tokens: 1882193920 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.585436E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.864 | TFLOPs: 40.63 | 15: iteration 3600/ 125429 | consumed samples: 921600 | consumed tokens: 1887436800 | elapsed time per iteration (s): 1.22 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.598098E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.470 | TFLOPs: 34.78 | 15: iteration 3610/ 125429 | consumed samples: 924160 | consumed tokens: 1892679680 | elapsed time per iteration (s): 1.02 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.617204E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.918 | TFLOPs: 41.30 | 15: iteration 3620/ 125429 | consumed samples: 926720 | consumed tokens: 1897922560 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.579934E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.673 | TFLOPs: 41.10 | 15: iteration 3630/ 125429 | consumed samples: 929280 | consumed tokens: 1903165440 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.610367E+00 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.487 | TFLOPs: 40.73 | 15: iteration 3640/ 125429 | consumed samples: 931840 | consumed tokens: 1908408320 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.619855E+00 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.535 | TFLOPs: 40.08 | 15: iteration 3650/ 125429 | consumed samples: 934400 | consumed tokens: 1913651200 | elapsed time per iteration (s): 1.07 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.576663E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.244 | TFLOPs: 39.70 | 15: iteration 3660/ 125429 | consumed samples: 936960 | consumed tokens: 1918894080 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.581079E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.830 | TFLOPs: 40.96 | 15: iteration 3670/ 125429 | consumed samples: 939520 | consumed tokens: 1924136960 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.576419E+00 | grad norm: 0.237 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.516 | TFLOPs: 40.08 | 15: iteration 3680/ 125429 | consumed samples: 942080 | consumed tokens: 1929379840 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.612349E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.774 | TFLOPs: 40.62 | 15: iteration 3690/ 125429 | consumed samples: 944640 | consumed tokens: 1934622720 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.607719E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.383 | TFLOPs: 40.39 | 15: iteration 3700/ 125429 | consumed samples: 947200 | consumed tokens: 1939865600 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.583607E+00 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.148 | TFLOPs: 40.68 | 15: iteration 3710/ 125429 | consumed samples: 949760 | consumed tokens: 1945108480 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.591092E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.218 | TFLOPs: 41.19 | 15: iteration 3720/ 125429 | consumed samples: 952320 | consumed tokens: 1950351360 | elapsed time per iteration (s): 1.07 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.605068E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.527 | TFLOPs: 39.58 | 15: iteration 3730/ 125429 | consumed samples: 954880 | consumed tokens: 1955594240 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.593591E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.439 | TFLOPs: 41.22 | 15: iteration 3740/ 125429 | consumed samples: 957440 | consumed tokens: 1960837120 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.575917E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.931 | TFLOPs: 40.15 | 15: iteration 3750/ 125429 | consumed samples: 960000 | consumed tokens: 1966080000 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.603875E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.307 | TFLOPs: 40.21 | 15: iteration 3760/ 125429 | consumed samples: 962560 | consumed tokens: 1971322880 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.554874E+00 | grad norm: 0.231 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.520 | TFLOPs: 40.74 | 15: iteration 3770/ 125429 | consumed samples: 965120 | consumed tokens: 1976565760 | elapsed time per iteration (s): 1.07 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.580625E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.185 | TFLOPs: 39.69 | 15: iteration 3780/ 125429 | consumed samples: 967680 | consumed tokens: 1981808640 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.576341E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.250 | TFLOPs: 40.86 | 15: iteration 3790/ 125429 | consumed samples: 970240 | consumed tokens: 1987051520 | elapsed time per iteration (s): 1.07 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.586015E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.383 | TFLOPs: 39.39 | 15: iteration 3800/ 125429 | consumed samples: 972800 | consumed tokens: 1992294400 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.586980E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.236 | TFLOPs: 41.02 | 15: iteration 3810/ 125429 | consumed samples: 975360 | consumed tokens: 1997537280 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.573586E+00 | grad norm: 0.235 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.971 | TFLOPs: 41.14 | 15: iteration 3820/ 125429 | consumed samples: 977920 | consumed tokens: 2002780160 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.587531E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.337 | TFLOPs: 41.20 | 15: iteration 3830/ 125429 | consumed samples: 980480 | consumed tokens: 2008023040 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.602803E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.460 | TFLOPs: 41.23 | 15: iteration 3840/ 125429 | consumed samples: 983040 | consumed tokens: 2013265920 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.570495E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.699 | TFLOPs: 41.10 | 15: iteration 3850/ 125429 | consumed samples: 985600 | consumed tokens: 2018508800 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.544162E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.592 | TFLOPs: 39.93 | 15: iteration 3860/ 125429 | consumed samples: 988160 | consumed tokens: 2023751680 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.562390E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.848 | TFLOPs: 40.96 | 15: iteration 3870/ 125429 | consumed samples: 990720 | consumed tokens: 2028994560 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.533546E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.086 | TFLOPs: 40.01 | 15: iteration 3880/ 125429 | consumed samples: 993280 | consumed tokens: 2034237440 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.583642E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.308 | TFLOPs: 40.87 | 15: iteration 3890/ 125429 | consumed samples: 995840 | consumed tokens: 2039480320 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.633900E+00 | grad norm: 0.644 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.222 | TFLOPs: 41.02 | 15: iteration 3900/ 125429 | consumed samples: 998400 | consumed tokens: 2044723200 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.569609E+00 | grad norm: 0.224 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.970 | TFLOPs: 40.81 | 15: iteration 3910/ 125429 | consumed samples: 1000960 | consumed tokens: 2049966080 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.578632E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.715 | TFLOPs: 41.27 | 15: iteration 3920/ 125429 | consumed samples: 1003520 | consumed tokens: 2055208960 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.567406E+00 | grad norm: 0.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.248 | TFLOPs: 40.86 | 15: iteration 3930/ 125429 | consumed samples: 1006080 | consumed tokens: 2060451840 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.610909E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.132 | TFLOPs: 41.01 | 15: iteration 3940/ 125429 | consumed samples: 1008640 | consumed tokens: 2065694720 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.542465E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.522 | TFLOPs: 40.90 | 15: iteration 3950/ 125429 | consumed samples: 1011200 | consumed tokens: 2070937600 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.561906E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.809 | TFLOPs: 40.29 | 15: iteration 3960/ 125429 | consumed samples: 1013760 | consumed tokens: 2076180480 | elapsed time per iteration (s): 1.08 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.534074E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.336 | TFLOPs: 39.06 | 15: iteration 3970/ 125429 | consumed samples: 1016320 | consumed tokens: 2081423360 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.592215E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.176 | TFLOPs: 40.85 | 15: iteration 3980/ 125429 | consumed samples: 1018880 | consumed tokens: 2086666240 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.567641E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.443 | TFLOPs: 40.40 | 15: iteration 3990/ 125429 | consumed samples: 1021440 | consumed tokens: 2091909120 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.582885E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.290 | TFLOPs: 39.88 | 0: [2022-11-25 20:56:48,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=0, lr=[0.00019978293964270148, 0.00019978293964270148, 0.00019978293964270148], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 4000/ 125429 | consumed samples: 1024000 | consumed tokens: 2097152000 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.541101E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.625 | TFLOPs: 40.76 | 0: steps: 4000 loss: 2.5048 iter time (s): 1.058 samples/sec: 241.926 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 4000 | lm loss value: 2.631830E+00 | lm loss PPL: 1.389918E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 4000 to checkpoints_1b5 0: [2022-11-25 20:56:49,095] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step4000 is begin to save! 0: [2022-11-25 20:56:49,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_01-model_00-model_states.pt... 0: [2022-11-25 20:56:49,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_01-model_00-model_states.pt. 0: [2022-11-25 20:56:49,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_03-model_00-model_states.pt... 0: [2022-11-25 20:56:49,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_03-model_00-model_states.pt. 0: [2022-11-25 20:56:49,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_04-model_00-model_states.pt... 0: [2022-11-25 20:56:49,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_04-model_00-model_states.pt. 0: [2022-11-25 20:56:49,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_05-model_00-model_states.pt... 0: [2022-11-25 20:56:49,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_05-model_00-model_states.pt. 0: [2022-11-25 20:56:49,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_06-model_00-model_states.pt... 0: [2022-11-25 20:56:49,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_06-model_00-model_states.pt. 0: [2022-11-25 20:56:49,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_07-model_00-model_states.pt... 0: [2022-11-25 20:56:49,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_07-model_00-model_states.pt. 0: [2022-11-25 20:56:49,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_08-model_00-model_states.pt... 0: [2022-11-25 20:56:49,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_08-model_00-model_states.pt. 0: [2022-11-25 20:56:49,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_09-model_00-model_states.pt... 0: [2022-11-25 20:56:50,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_09-model_00-model_states.pt. 0: [2022-11-25 20:56:50,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_10-model_00-model_states.pt... 0: [2022-11-25 20:56:50,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_10-model_00-model_states.pt. 0: [2022-11-25 20:56:50,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_11-model_00-model_states.pt... 0: [2022-11-25 20:56:50,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_11-model_00-model_states.pt. 0: [2022-11-25 20:56:50,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_12-model_00-model_states.pt... 0: [2022-11-25 20:56:50,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_12-model_00-model_states.pt. 0: [2022-11-25 20:56:50,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_13-model_00-model_states.pt... 0: [2022-11-25 20:56:50,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_13-model_00-model_states.pt. 0: [2022-11-25 20:56:50,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_14-model_00-model_states.pt... 0: [2022-11-25 20:56:50,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_14-model_00-model_states.pt. 0: [2022-11-25 20:56:50,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_15-model_00-model_states.pt... 0: [2022-11-25 20:56:50,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_15-model_00-model_states.pt. 0: [2022-11-25 20:56:50,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_16-model_00-model_states.pt... 0: [2022-11-25 20:56:50,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_16-model_00-model_states.pt. 0: [2022-11-25 20:56:50,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_17-model_00-model_states.pt... 0: [2022-11-25 20:56:50,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_17-model_00-model_states.pt. 0: [2022-11-25 20:56:50,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_18-model_00-model_states.pt... 0: [2022-11-25 20:56:50,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_18-model_00-model_states.pt. 0: [2022-11-25 20:56:50,933] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_19-model_00-model_states.pt... 0: [2022-11-25 20:56:51,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_19-model_00-model_states.pt. 0: [2022-11-25 20:56:51,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_20-model_00-model_states.pt... 0: [2022-11-25 20:56:51,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_20-model_00-model_states.pt. 0: [2022-11-25 20:56:51,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_21-model_00-model_states.pt... 0: [2022-11-25 20:56:51,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_21-model_00-model_states.pt. 0: [2022-11-25 20:56:51,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_22-model_00-model_states.pt... 0: [2022-11-25 20:56:51,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_22-model_00-model_states.pt. 0: [2022-11-25 20:56:51,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_23-model_00-model_states.pt... 0: [2022-11-25 20:56:51,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_23-model_00-model_states.pt. 0: [2022-11-25 20:56:51,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_24-model_00-model_states.pt... 0: [2022-11-25 20:56:51,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_24-model_00-model_states.pt. 0: [2022-11-25 20:56:51,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_25-model_00-model_states.pt... 0: [2022-11-25 20:56:51,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_25-model_00-model_states.pt. 0: [2022-11-25 20:56:51,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_26-model_00-model_states.pt... 0: [2022-11-25 20:56:51,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_26-model_00-model_states.pt. 0: [2022-11-25 20:56:51,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_27-model_00-model_states.pt... 0: [2022-11-25 20:56:51,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_27-model_00-model_states.pt. 0: [2022-11-25 20:56:51,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_28-model_00-model_states.pt... 0: [2022-11-25 20:56:51,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_28-model_00-model_states.pt. 0: [2022-11-25 20:56:51,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_29-model_00-model_states.pt... 0: [2022-11-25 20:56:52,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_29-model_00-model_states.pt. 0: [2022-11-25 20:56:52,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_30-model_00-model_states.pt... 0: [2022-11-25 20:56:52,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_30-model_00-model_states.pt. 0: [2022-11-25 20:56:52,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/layer_32-model_00-model_states.pt... 0: [2022-11-25 20:56:52,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/layer_32-model_00-model_states.pt. 0: [2022-11-25 20:56:52,129] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step4000/mp_rank_00_model_states.pt 0: [2022-11-25 20:56:52,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/mp_rank_00_model_states.pt... 0: [2022-11-25 20:56:52,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/mp_rank_00_model_states.pt. 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 5: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 20:56:52,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 1: [2022-11-25 20:56:52,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:52,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 20:56:52,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 20:56:52,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:52,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 20:56:52,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 20:56:52,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:52,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 20:56:52,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:52,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 20:56:52,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 20:56:52,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 20:56:52,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 20:56:52,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:56:52,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 20:56:52,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 20:56:52,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:56:52,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:56:52,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:52,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 20:56:52,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 20:56:52,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 20:56:52,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 20:56:52,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 20:56:52,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 20:56:52,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 20:56:52,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 20:56:52,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 20:56:52,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:56:52,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 20:56:52,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 20:56:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:52,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 20:56:52,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 20:56:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:56:52,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:52,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 7: [2022-11-25 20:56:52,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 20:56:52,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 20:56:52,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:52,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 20:56:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 20:56:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 5: [2022-11-25 20:56:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 5: [2022-11-25 20:56:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:52,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:56:52,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:56:52,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:56:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 20:56:52,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:56:52,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 9: [2022-11-25 20:56:52,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 20:56:52,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 20:56:52,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 15: [2022-11-25 20:56:52,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 20:56:52,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 20:56:52,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:56:52,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:56:52,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 20:56:52,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:52,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 20:56:52,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:56:52,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:56:52,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 0: [2022-11-25 20:56:52,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: [2022-11-25 20:56:52,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:52,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 20:56:52,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 20:56:52,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 12: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 20:56:52,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:56:52,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 13: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 13: [2022-11-25 20:56:52,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 13: [2022-11-25 20:56:52,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 2: [2022-11-25 20:56:52,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 20:56:52,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 20:56:52,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 6: [2022-11-25 20:56:52,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 20:56:52,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 20:56:52,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 20:56:52,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:56:52,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 10: [2022-11-25 20:56:52,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 20:56:52,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 20:56:52,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:56:52,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:56:52,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 20:56:52,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 20:56:52,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 20:56:52,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 4: [2022-11-25 20:56:52,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 20:56:52,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 20:56:52,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 20:56:52,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 20:56:52,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 11: [2022-11-25 20:56:52,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 20:56:52,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 20:56:52,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:52,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 8: [2022-11-25 20:56:52,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 20:56:52,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 20:56:52,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 3: [2022-11-25 20:56:52,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 20:56:52,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 20:56:52,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 14: [2022-11-25 20:56:52,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 20:56:52,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 20:56:52,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 1: [2022-11-25 20:56:52,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 20:56:52,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step4000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 20:56:52,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now! 0: successfully saved checkpoint at iteration 4000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3484.65 15: iteration 4010/ 125429 | consumed samples: 1026560 | consumed tokens: 2102394880 | elapsed time per iteration (s): 1.48 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.550515E+00 | grad norm: 0.222 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.257 | TFLOPs: 28.63 | 15: iteration 4020/ 125429 | consumed samples: 1029120 | consumed tokens: 2107637760 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.581148E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.299 | TFLOPs: 41.20 | 15: iteration 4030/ 125429 | consumed samples: 1031680 | consumed tokens: 2112880640 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.549599E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.489 | TFLOPs: 40.73 | 15: iteration 4040/ 125429 | consumed samples: 1034240 | consumed tokens: 2118123520 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.522967E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.691 | TFLOPs: 40.93 | 15: iteration 4050/ 125429 | consumed samples: 1036800 | consumed tokens: 2123366400 | elapsed time per iteration (s): 1.09 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.532263E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.310 | TFLOPs: 38.89 | 15: iteration 4060/ 125429 | consumed samples: 1039360 | consumed tokens: 2128609280 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.568609E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.595 | TFLOPs: 40.09 | 15: iteration 4070/ 125429 | consumed samples: 1041920 | consumed tokens: 2133852160 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.578291E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.177 | TFLOPs: 40.85 | 15: iteration 4080/ 125429 | consumed samples: 1044480 | consumed tokens: 2139095040 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.547211E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.072 | TFLOPs: 40.17 | 15: iteration 4090/ 125429 | consumed samples: 1047040 | consumed tokens: 2144337920 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.555684E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.447 | TFLOPs: 40.73 | 15: iteration 4100/ 125429 | consumed samples: 1049600 | consumed tokens: 2149580800 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.548930E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.882 | TFLOPs: 40.47 | 15: iteration 4110/ 125429 | consumed samples: 1052160 | consumed tokens: 2154823680 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.513742E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.302 | TFLOPs: 40.21 | 15: iteration 4120/ 125429 | consumed samples: 1054720 | consumed tokens: 2160066560 | elapsed time per iteration (s): 1.03 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.507498E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.324 | TFLOPs: 41.04 | 15: iteration 4130/ 125429 | consumed samples: 1057280 | consumed tokens: 2165309440 | elapsed time per iteration (s): 1.07 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.514402E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.504 | TFLOPs: 39.41 | 15: iteration 4140/ 125429 | consumed samples: 1059840 | consumed tokens: 2170552320 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.530593E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.058 | TFLOPs: 40.83 | 15: iteration 4150/ 125429 | consumed samples: 1062400 | consumed tokens: 2175795200 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.542703E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.959 | TFLOPs: 39.99 | 15: iteration 4160/ 125429 | consumed samples: 1064960 | consumed tokens: 2181038080 | elapsed time per iteration (s): 1.05 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.541502E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.683 | TFLOPs: 40.44 | 15: iteration 4170/ 125429 | consumed samples: 1067520 | consumed tokens: 2186280960 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.516000E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.393 | TFLOPs: 40.06 | 15: iteration 4180/ 125429 | consumed samples: 1070080 | consumed tokens: 2191523840 | elapsed time per iteration (s): 1.02 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.535572E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.168 | TFLOPs: 41.34 | 15: iteration 4190/ 125429 | consumed samples: 1072640 | consumed tokens: 2196766720 | elapsed time per iteration (s): 1.04 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.516043E+00 | grad norm: 0.402 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.002 | TFLOPs: 40.49 | 15: iteration 4200/ 125429 | consumed samples: 1075200 | consumed tokens: 2202009600 | elapsed time per iteration (s): 1.06 | learning rate: 1.998E-04 | global batch size: 256 | lm loss: 2.558909E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.942 | TFLOPs: 39.98 | 15: iteration 4210/ 125429 | consumed samples: 1077760 | consumed tokens: 2207252480 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.546985E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.324 | TFLOPs: 40.71 | 15: iteration 4220/ 125429 | consumed samples: 1080320 | consumed tokens: 2212495360 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.559942E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.119 | TFLOPs: 41.00 | 15: iteration 4230/ 125429 | consumed samples: 1082880 | consumed tokens: 2217738240 | elapsed time per iteration (s): 1.10 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.535564E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.558 | TFLOPs: 38.60 | 15: iteration 4240/ 125429 | consumed samples: 1085440 | consumed tokens: 2222981120 | elapsed time per iteration (s): 1.06 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.535352E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.337 | TFLOPs: 39.88 | 15: iteration 4250/ 125429 | consumed samples: 1088000 | consumed tokens: 2228224000 | elapsed time per iteration (s): 1.02 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.514529E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.211 | TFLOPs: 41.51 | 15: iteration 4260/ 125429 | consumed samples: 1090560 | consumed tokens: 2233466880 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.538732E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.566 | TFLOPs: 40.58 | 15: iteration 4270/ 125429 | consumed samples: 1093120 | consumed tokens: 2238709760 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.496811E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.403 | TFLOPs: 41.22 | 15: iteration 4280/ 125429 | consumed samples: 1095680 | consumed tokens: 2243952640 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.533432E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.577 | TFLOPs: 40.91 | 15: iteration 4290/ 125429 | consumed samples: 1098240 | consumed tokens: 2249195520 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.516260E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.102 | TFLOPs: 40.34 | 15: iteration 4300/ 125429 | consumed samples: 1100800 | consumed tokens: 2254438400 | elapsed time per iteration (s): 1.09 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.531666E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.874 | TFLOPs: 38.98 | 15: iteration 4310/ 125429 | consumed samples: 1103360 | consumed tokens: 2259681280 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.522408E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.490 | TFLOPs: 40.24 | 15: iteration 4320/ 125429 | consumed samples: 1105920 | consumed tokens: 2264924160 | elapsed time per iteration (s): 1.06 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.518981E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.174 | TFLOPs: 40.02 | 15: iteration 4330/ 125429 | consumed samples: 1108480 | consumed tokens: 2270167040 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.494979E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.631 | TFLOPs: 40.59 | 15: iteration 4340/ 125429 | consumed samples: 1111040 | consumed tokens: 2275409920 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.554833E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.420 | TFLOPs: 40.39 | 15: iteration 4350/ 125429 | consumed samples: 1113600 | consumed tokens: 2280652800 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.524725E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.014 | TFLOPs: 40.82 | 15: iteration 4360/ 125429 | consumed samples: 1116160 | consumed tokens: 2285895680 | elapsed time per iteration (s): 1.07 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.517692E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.468 | TFLOPs: 39.41 | 15: iteration 4370/ 125429 | consumed samples: 1118720 | consumed tokens: 2291138560 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.551630E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.571 | TFLOPs: 40.75 | 15: iteration 4380/ 125429 | consumed samples: 1121280 | consumed tokens: 2296381440 | elapsed time per iteration (s): 1.07 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.515407E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.922 | TFLOPs: 39.48 | 15: iteration 4390/ 125429 | consumed samples: 1123840 | consumed tokens: 2301624320 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.545401E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.555 | TFLOPs: 40.41 | 15: iteration 4400/ 125429 | consumed samples: 1126400 | consumed tokens: 2306867200 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.524343E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.402 | TFLOPs: 40.39 | 15: iteration 4410/ 125429 | consumed samples: 1128960 | consumed tokens: 2312110080 | elapsed time per iteration (s): 1.06 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.500965E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.141 | TFLOPs: 39.85 | 15: iteration 4420/ 125429 | consumed samples: 1131520 | consumed tokens: 2317352960 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.511635E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.171 | TFLOPs: 41.01 | 15: iteration 4430/ 125429 | consumed samples: 1134080 | consumed tokens: 2322595840 | elapsed time per iteration (s): 1.08 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.489141E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.547 | TFLOPs: 39.26 | 15: iteration 4440/ 125429 | consumed samples: 1136640 | consumed tokens: 2327838720 | elapsed time per iteration (s): 1.09 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.523056E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.572 | TFLOPs: 38.93 | 15: iteration 4450/ 125429 | consumed samples: 1139200 | consumed tokens: 2333081600 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.525904E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.113 | TFLOPs: 40.84 | 15: iteration 4460/ 125429 | consumed samples: 1141760 | consumed tokens: 2338324480 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.465419E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.541 | TFLOPs: 40.25 | 15: iteration 4470/ 125429 | consumed samples: 1144320 | consumed tokens: 2343567360 | elapsed time per iteration (s): 1.02 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.503825E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.298 | TFLOPs: 41.36 | 15: iteration 4480/ 125429 | consumed samples: 1146880 | consumed tokens: 2348810240 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.523105E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.153 | TFLOPs: 40.68 | 15: iteration 4490/ 125429 | consumed samples: 1149440 | consumed tokens: 2354053120 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.477918E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.026 | TFLOPs: 41.15 | 15: iteration 4500/ 125429 | consumed samples: 1152000 | consumed tokens: 2359296000 | elapsed time per iteration (s): 1.07 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.508641E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.580 | TFLOPs: 39.59 | 15: iteration 4510/ 125429 | consumed samples: 1154560 | consumed tokens: 2364538880 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.491173E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.307 | TFLOPs: 40.70 | 15: iteration 4520/ 125429 | consumed samples: 1157120 | consumed tokens: 2369781760 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.512050E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.495 | TFLOPs: 41.07 | 15: iteration 4530/ 125429 | consumed samples: 1159680 | consumed tokens: 2375024640 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.538020E+00 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.707 | TFLOPs: 40.27 | 15: iteration 4540/ 125429 | consumed samples: 1162240 | consumed tokens: 2380267520 | elapsed time per iteration (s): 1.10 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.517540E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.436 | TFLOPs: 38.41 | 15: iteration 4550/ 125429 | consumed samples: 1164800 | consumed tokens: 2385510400 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.491823E+00 | grad norm: 0.212 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.813 | TFLOPs: 41.12 | 15: iteration 4560/ 125429 | consumed samples: 1167360 | consumed tokens: 2390753280 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.507635E+00 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.202 | TFLOPs: 41.02 | 15: iteration 4570/ 125429 | consumed samples: 1169920 | consumed tokens: 2395996160 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.495405E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.875 | TFLOPs: 40.30 | 15: iteration 4580/ 125429 | consumed samples: 1172480 | consumed tokens: 2401239040 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.781153E+00 | grad norm: 9.779 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.119 | TFLOPs: 40.84 | 15: iteration 4590/ 125429 | consumed samples: 1175040 | consumed tokens: 2406481920 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.881188E+00 | grad norm: 1.318 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.948 | TFLOPs: 40.31 | 15: iteration 4600/ 125429 | consumed samples: 1177600 | consumed tokens: 2411724800 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.818630E+00 | grad norm: 0.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.950 | TFLOPs: 40.48 | 15: iteration 4610/ 125429 | consumed samples: 1180160 | consumed tokens: 2416967680 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.711610E+00 | grad norm: 0.423 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.931 | TFLOPs: 40.81 | 15: iteration 4620/ 125429 | consumed samples: 1182720 | consumed tokens: 2422210560 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.618240E+00 | grad norm: 0.264 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.638 | TFLOPs: 40.59 | 15: iteration 4630/ 125429 | consumed samples: 1185280 | consumed tokens: 2427453440 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.593866E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.972 | TFLOPs: 40.98 | 15: iteration 4640/ 125429 | consumed samples: 1187840 | consumed tokens: 2432696320 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.583689E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.123 | TFLOPs: 41.17 | 15: iteration 4650/ 125429 | consumed samples: 1190400 | consumed tokens: 2437939200 | elapsed time per iteration (s): 1.03 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.531022E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.423 | TFLOPs: 40.89 | 15: iteration 4660/ 125429 | consumed samples: 1192960 | consumed tokens: 2443182080 | elapsed time per iteration (s): 1.09 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.521641E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.718 | TFLOPs: 38.95 | 15: iteration 4670/ 125429 | consumed samples: 1195520 | consumed tokens: 2448424960 | elapsed time per iteration (s): 1.05 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.501534E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.527 | TFLOPs: 40.24 | 15: iteration 4680/ 125429 | consumed samples: 1198080 | consumed tokens: 2453667840 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.513846E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.254 | TFLOPs: 40.70 | 15: iteration 4690/ 125429 | consumed samples: 1200640 | consumed tokens: 2458910720 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.534582E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.338 | TFLOPs: 40.54 | 15: iteration 4700/ 125429 | consumed samples: 1203200 | consumed tokens: 2464153600 | elapsed time per iteration (s): 1.06 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.515768E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.096 | TFLOPs: 40.01 | 15: iteration 4710/ 125429 | consumed samples: 1205760 | consumed tokens: 2469396480 | elapsed time per iteration (s): 1.07 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.491920E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.724 | TFLOPs: 39.62 | 15: iteration 4720/ 125429 | consumed samples: 1208320 | consumed tokens: 2474639360 | elapsed time per iteration (s): 1.08 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.512574E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.991 | TFLOPs: 39.00 | 15: iteration 4730/ 125429 | consumed samples: 1210880 | consumed tokens: 2479882240 | elapsed time per iteration (s): 1.04 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.486798E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.655 | TFLOPs: 40.60 | 15: iteration 4740/ 125429 | consumed samples: 1213440 | consumed tokens: 2485125120 | elapsed time per iteration (s): 1.09 | learning rate: 1.997E-04 | global batch size: 256 | lm loss: 2.507013E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.322 | TFLOPs: 38.89 | 15: iteration 4750/ 125429 | consumed samples: 1216000 | consumed tokens: 2490368000 | elapsed time per iteration (s): 1.05 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.499653E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.978 | TFLOPs: 40.15 | 15: iteration 4760/ 125429 | consumed samples: 1218560 | consumed tokens: 2495610880 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.503885E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.653 | TFLOPs: 40.60 | 15: iteration 4770/ 125429 | consumed samples: 1221120 | consumed tokens: 2500853760 | elapsed time per iteration (s): 1.07 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.529767E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.257 | TFLOPs: 39.54 | 15: iteration 4780/ 125429 | consumed samples: 1223680 | consumed tokens: 2506096640 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.484462E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.073 | TFLOPs: 40.50 | 15: iteration 4790/ 125429 | consumed samples: 1226240 | consumed tokens: 2511339520 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.506268E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.428 | TFLOPs: 40.72 | 15: iteration 4800/ 125429 | consumed samples: 1228800 | consumed tokens: 2516582400 | elapsed time per iteration (s): 1.05 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.521037E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.574 | TFLOPs: 40.42 | 15: iteration 4810/ 125429 | consumed samples: 1231360 | consumed tokens: 2521825280 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.505760E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.318 | TFLOPs: 40.54 | 15: iteration 4820/ 125429 | consumed samples: 1233920 | consumed tokens: 2527068160 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.491307E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.379 | TFLOPs: 41.05 | 15: iteration 4830/ 125429 | consumed samples: 1236480 | consumed tokens: 2532311040 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.478669E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.250 | TFLOPs: 41.03 | 15: iteration 4840/ 125429 | consumed samples: 1239040 | consumed tokens: 2537553920 | elapsed time per iteration (s): 1.07 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.457905E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.461 | TFLOPs: 39.57 | 15: iteration 4850/ 125429 | consumed samples: 1241600 | consumed tokens: 2542796800 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.492313E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.218 | TFLOPs: 41.02 | 15: iteration 4860/ 125429 | consumed samples: 1244160 | consumed tokens: 2548039680 | elapsed time per iteration (s): 1.06 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.522777E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.773 | TFLOPs: 39.95 | 15: iteration 4870/ 125429 | consumed samples: 1246720 | consumed tokens: 2553282560 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.481905E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.142 | TFLOPs: 40.84 | 15: iteration 4880/ 125429 | consumed samples: 1249280 | consumed tokens: 2558525440 | elapsed time per iteration (s): 1.07 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.522367E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.362 | TFLOPs: 39.56 | 15: iteration 4890/ 125429 | consumed samples: 1251840 | consumed tokens: 2563768320 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.498311E+00 | grad norm: 0.217 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.473 | TFLOPs: 41.06 | 15: iteration 4900/ 125429 | consumed samples: 1254400 | consumed tokens: 2569011200 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.473528E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.984 | TFLOPs: 40.98 | 15: iteration 4910/ 125429 | consumed samples: 1256960 | consumed tokens: 2574254080 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.461399E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.130 | TFLOPs: 41.17 | 15: iteration 4920/ 125429 | consumed samples: 1259520 | consumed tokens: 2579496960 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.505891E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.095 | TFLOPs: 41.16 | 15: iteration 4930/ 125429 | consumed samples: 1262080 | consumed tokens: 2584739840 | elapsed time per iteration (s): 1.07 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.481445E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.908 | TFLOPs: 39.48 | 15: iteration 4940/ 125429 | consumed samples: 1264640 | consumed tokens: 2589982720 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.491524E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.907 | TFLOPs: 41.13 | 15: iteration 4950/ 125429 | consumed samples: 1267200 | consumed tokens: 2595225600 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.511776E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.642 | TFLOPs: 41.09 | 15: iteration 4960/ 125429 | consumed samples: 1269760 | consumed tokens: 2600468480 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.480489E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.887 | TFLOPs: 40.97 | 15: iteration 4970/ 125429 | consumed samples: 1272320 | consumed tokens: 2605711360 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.475965E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.103 | TFLOPs: 40.67 | 15: iteration 4980/ 125429 | consumed samples: 1274880 | consumed tokens: 2610954240 | elapsed time per iteration (s): 1.06 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.490342E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.948 | TFLOPs: 39.98 | 15: iteration 4990/ 125429 | consumed samples: 1277440 | consumed tokens: 2616197120 | elapsed time per iteration (s): 1.05 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.468115E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.860 | TFLOPs: 40.47 | 15: iteration 5000/ 125429 | consumed samples: 1280000 | consumed tokens: 2621440000 | elapsed time per iteration (s): 1.05 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.462251E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.694 | TFLOPs: 40.27 | 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 5000 | lm loss value: 2.500757E+00 | lm loss PPL: 1.219172E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 5000 to checkpoints_1b5 0: [2022-11-25 21:14:21,178] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step5000 is begin to save! 0: [2022-11-25 21:14:21,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:14:21,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:14:21,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:14:21,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:14:21,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:14:21,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:14:21,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:14:21,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:14:21,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:14:21,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:14:21,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:14:21,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:14:21,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:14:22,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:14:22,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:14:22,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:14:22,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:14:22,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:14:22,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:14:22,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:14:22,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:14:22,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:14:22,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:14:22,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:14:22,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:14:22,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:14:22,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:14:22,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:14:22,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:14:22,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:14:22,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:14:23,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:14:23,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:14:23,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:14:23,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:14:23,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:14:23,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:14:23,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:14:23,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:14:23,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:14:23,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:14:23,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:14:23,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:14:23,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:14:23,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:14:23,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:14:23,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:14:23,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:14:23,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:14:23,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:14:23,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:14:24,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:14:24,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:14:24,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:14:24,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_29-model_00-model_states.pt... 0: [2022-11-25 21:14:24,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_29-model_00-model_states.pt. 0: [2022-11-25 21:14:24,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:14:24,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:14:24,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/layer_32-model_00-model_states.pt... 0: [2022-11-25 21:14:24,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/layer_32-model_00-model_states.pt. 0: [2022-11-25 21:14:24,392] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step5000/mp_rank_00_model_states.pt 0: [2022-11-25 21:14:24,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:14:24,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:14:24,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:14:24,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:24,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 21:14:24,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 21:14:24,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:24,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 21:14:24,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 21:14:24,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:24,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:24,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:24,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 21:14:24,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:24,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:24,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 21:14:24,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 21:14:24,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:14:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 5: [2022-11-25 21:14:24,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:14:24,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 21:14:24,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:14:24,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 6: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:24,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 21:14:24,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:14:24,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:24,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:24,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:24,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:24,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:14:24,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 21:14:24,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 15: [2022-11-25 21:14:24,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:14:24,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:24,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:24,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:24,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: [2022-11-25 21:14:24,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:14:24,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:14:24,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:14:24,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 4: [2022-11-25 21:14:24,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 14: [2022-11-25 21:14:24,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 12: [2022-11-25 21:14:24,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:14:24,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:14:24,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 9: [2022-11-25 21:14:24,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:14:24,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:14:24,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 13: [2022-11-25 21:14:24,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:14:24,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 21:14:24,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 21:14:24,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 21:14:24,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 21:14:24,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 10: [2022-11-25 21:14:24,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:14:24,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 21:14:24,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 21:14:24,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 21:14:24,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 11: [2022-11-25 21:14:24,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:14:24,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 21:14:24,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:24,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:14:24,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 21:14:24,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:24,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 7: [2022-11-25 21:14:24,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:14:24,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:14:24,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 21:14:24,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 21:14:24,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:14:24,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 21:14:24,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 3: [2022-11-25 21:14:24,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 2: [2022-11-25 21:14:24,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:14:24,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:14:24,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 21:14:24,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 8: [2022-11-25 21:14:24,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:14:24,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 21:14:24,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 1: [2022-11-25 21:14:24,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:14:24,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step5000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 21:14:24,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now! 0: successfully saved checkpoint at iteration 5000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3637.22 15: iteration 5010/ 125429 | consumed samples: 1282560 | consumed tokens: 2626682880 | elapsed time per iteration (s): 1.43 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.472745E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.373 | TFLOPs: 29.64 | 15: iteration 5020/ 125429 | consumed samples: 1285120 | consumed tokens: 2631925760 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.452662E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.998 | TFLOPs: 40.82 | 15: iteration 5030/ 125429 | consumed samples: 1287680 | consumed tokens: 2637168640 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.466067E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.496 | TFLOPs: 41.07 | 15: iteration 5040/ 125429 | consumed samples: 1290240 | consumed tokens: 2642411520 | elapsed time per iteration (s): 1.02 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.446763E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.089 | TFLOPs: 41.49 | 15: iteration 5050/ 125429 | consumed samples: 1292800 | consumed tokens: 2647654400 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.442650E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.104 | TFLOPs: 40.67 | 15: iteration 5060/ 125429 | consumed samples: 1295360 | consumed tokens: 2652897280 | elapsed time per iteration (s): 1.05 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.447275E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.620 | TFLOPs: 40.26 | 15: iteration 5070/ 125429 | consumed samples: 1297920 | consumed tokens: 2658140160 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.468184E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.599 | TFLOPs: 40.92 | 15: iteration 5080/ 125429 | consumed samples: 1300480 | consumed tokens: 2663383040 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.490862E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.979 | TFLOPs: 41.15 | 15: iteration 5090/ 125429 | consumed samples: 1303040 | consumed tokens: 2668625920 | elapsed time per iteration (s): 1.05 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.469209E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.748 | TFLOPs: 40.12 | 15: iteration 5100/ 125429 | consumed samples: 1305600 | consumed tokens: 2673868800 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.496607E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.998 | TFLOPs: 40.82 | 15: iteration 5110/ 125429 | consumed samples: 1308160 | consumed tokens: 2679111680 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.475186E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.786 | TFLOPs: 40.62 | 15: iteration 5120/ 125429 | consumed samples: 1310720 | consumed tokens: 2684354560 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.483790E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.093 | TFLOPs: 41.16 | 15: iteration 5130/ 125429 | consumed samples: 1313280 | consumed tokens: 2689597440 | elapsed time per iteration (s): 1.05 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.479454E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.694 | TFLOPs: 40.44 | 15: iteration 5140/ 125429 | consumed samples: 1315840 | consumed tokens: 2694840320 | elapsed time per iteration (s): 1.04 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.465687E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.122 | TFLOPs: 40.84 | 15: iteration 5150/ 125429 | consumed samples: 1318400 | consumed tokens: 2700083200 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.479592E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.547 | TFLOPs: 41.07 | 15: iteration 5160/ 125429 | consumed samples: 1320960 | consumed tokens: 2705326080 | elapsed time per iteration (s): 1.06 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.480684E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.236 | TFLOPs: 39.87 | 15: iteration 5170/ 125429 | consumed samples: 1323520 | consumed tokens: 2710568960 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.462291E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.816 | TFLOPs: 41.12 | 15: iteration 5180/ 125429 | consumed samples: 1326080 | consumed tokens: 2715811840 | elapsed time per iteration (s): 1.06 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.469377E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.128 | TFLOPs: 40.01 | 15: iteration 5190/ 125429 | consumed samples: 1328640 | consumed tokens: 2721054720 | elapsed time per iteration (s): 1.02 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.449172E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.207 | TFLOPs: 41.35 | 15: iteration 5200/ 125429 | consumed samples: 1331200 | consumed tokens: 2726297600 | elapsed time per iteration (s): 1.03 | learning rate: 1.996E-04 | global batch size: 256 | lm loss: 2.458554E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.478 | TFLOPs: 41.06 | 15: iteration 5210/ 125429 | consumed samples: 1333760 | consumed tokens: 2731540480 | elapsed time per iteration (s): 1.08 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.439075E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.148 | TFLOPs: 39.19 | 15: iteration 5220/ 125429 | consumed samples: 1336320 | consumed tokens: 2736783360 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.449169E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.070 | TFLOPs: 40.17 | 15: iteration 5230/ 125429 | consumed samples: 1338880 | consumed tokens: 2742026240 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.488368E+00 | grad norm: 0.271 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.544 | TFLOPs: 40.41 | 15: iteration 5240/ 125429 | consumed samples: 1341440 | consumed tokens: 2747269120 | elapsed time per iteration (s): 1.07 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.442923E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.460 | TFLOPs: 39.57 | 15: iteration 5250/ 125429 | consumed samples: 1344000 | consumed tokens: 2752512000 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.467970E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.121 | TFLOPs: 40.67 | 15: iteration 5260/ 125429 | consumed samples: 1346560 | consumed tokens: 2757754880 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.481381E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.842 | TFLOPs: 40.63 | 15: iteration 5270/ 125429 | consumed samples: 1349120 | consumed tokens: 2762997760 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.423169E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.862 | TFLOPs: 40.63 | 15: iteration 5280/ 125429 | consumed samples: 1351680 | consumed tokens: 2768240640 | elapsed time per iteration (s): 1.02 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.474042E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.821 | TFLOPs: 41.28 | 15: iteration 5290/ 125429 | consumed samples: 1354240 | consumed tokens: 2773483520 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.439718E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.124 | TFLOPs: 40.84 | 15: iteration 5300/ 125429 | consumed samples: 1356800 | consumed tokens: 2778726400 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.443795E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.656 | TFLOPs: 40.10 | 15: iteration 5310/ 125429 | consumed samples: 1359360 | consumed tokens: 2783969280 | elapsed time per iteration (s): 2.28 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.480139E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 112.187 | TFLOPs: 18.54 | 15: iteration 5320/ 125429 | consumed samples: 1361920 | consumed tokens: 2789212160 | elapsed time per iteration (s): 1.03 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.451912E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.608 | TFLOPs: 41.25 | 15: iteration 5330/ 125429 | consumed samples: 1364480 | consumed tokens: 2794455040 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.458507E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.649 | TFLOPs: 40.43 | 15: iteration 5340/ 125429 | consumed samples: 1367040 | consumed tokens: 2799697920 | elapsed time per iteration (s): 1.09 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.478226E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.129 | TFLOPs: 38.69 | 15: iteration 5350/ 125429 | consumed samples: 1369600 | consumed tokens: 2804940800 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.431020E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.103 | TFLOPs: 40.51 | 15: iteration 5360/ 125429 | consumed samples: 1372160 | consumed tokens: 2810183680 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.481886E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.462 | TFLOPs: 40.23 | 15: iteration 5370/ 125429 | consumed samples: 1374720 | consumed tokens: 2815426560 | elapsed time per iteration (s): 1.07 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.467318E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.188 | TFLOPs: 39.53 | 15: iteration 5380/ 125429 | consumed samples: 1377280 | consumed tokens: 2820669440 | elapsed time per iteration (s): 1.03 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.427686E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.822 | TFLOPs: 41.12 | 15: iteration 5390/ 125429 | consumed samples: 1379840 | consumed tokens: 2825912320 | elapsed time per iteration (s): 1.08 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.462727E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.307 | TFLOPs: 39.05 | 15: iteration 5400/ 125429 | consumed samples: 1382400 | consumed tokens: 2831155200 | elapsed time per iteration (s): 1.03 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.425937E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.313 | TFLOPs: 41.04 | 15: iteration 5410/ 125429 | consumed samples: 1384960 | consumed tokens: 2836398080 | elapsed time per iteration (s): 1.06 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.458598E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.000 | TFLOPs: 39.99 | 15: iteration 5420/ 125429 | consumed samples: 1387520 | consumed tokens: 2841640960 | elapsed time per iteration (s): 1.27 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.444503E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 202.324 | TFLOPs: 33.44 | 15: iteration 5430/ 125429 | consumed samples: 1390080 | consumed tokens: 2846883840 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.467591E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.454 | TFLOPs: 40.40 | 15: iteration 5440/ 125429 | consumed samples: 1392640 | consumed tokens: 2852126720 | elapsed time per iteration (s): 1.06 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.451554E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.411 | TFLOPs: 40.06 | 15: iteration 5450/ 125429 | consumed samples: 1395200 | consumed tokens: 2857369600 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.455208E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.702 | TFLOPs: 40.44 | 15: iteration 5460/ 125429 | consumed samples: 1397760 | consumed tokens: 2862612480 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.473425E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.132 | TFLOPs: 40.34 | 15: iteration 5470/ 125429 | consumed samples: 1400320 | consumed tokens: 2867855360 | elapsed time per iteration (s): 1.03 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.454737E+00 | grad norm: 0.232 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.295 | TFLOPs: 41.03 | 15: iteration 5480/ 125429 | consumed samples: 1402880 | consumed tokens: 2873098240 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.451764E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.722 | TFLOPs: 40.77 | 15: iteration 5490/ 125429 | consumed samples: 1405440 | consumed tokens: 2878341120 | elapsed time per iteration (s): 1.06 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.425315E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.430 | TFLOPs: 40.06 | 15: iteration 5500/ 125429 | consumed samples: 1408000 | consumed tokens: 2883584000 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.481299E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.726 | TFLOPs: 40.77 | 15: iteration 5510/ 125429 | consumed samples: 1410560 | consumed tokens: 2888826880 | elapsed time per iteration (s): 1.07 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.448934E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.389 | TFLOPs: 39.56 | 15: iteration 5520/ 125429 | consumed samples: 1413120 | consumed tokens: 2894069760 | elapsed time per iteration (s): 1.06 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.477837E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.481 | TFLOPs: 40.07 | 15: iteration 5530/ 125429 | consumed samples: 1415680 | consumed tokens: 2899312640 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.446982E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.060 | TFLOPs: 40.33 | 15: iteration 5540/ 125429 | consumed samples: 1418240 | consumed tokens: 2904555520 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.486059E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.813 | TFLOPs: 40.62 | 15: iteration 5550/ 125429 | consumed samples: 1420800 | consumed tokens: 2909798400 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.452393E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.547 | TFLOPs: 40.41 | 15: iteration 5560/ 125429 | consumed samples: 1423360 | consumed tokens: 2915041280 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.445771E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.891 | TFLOPs: 40.47 | 15: iteration 5570/ 125429 | consumed samples: 1425920 | consumed tokens: 2920284160 | elapsed time per iteration (s): 1.05 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.470421E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.667 | TFLOPs: 40.27 | 15: iteration 5580/ 125429 | consumed samples: 1428480 | consumed tokens: 2925527040 | elapsed time per iteration (s): 2.73 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.434591E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 93.636 | TFLOPs: 15.47 | 15: iteration 5590/ 125429 | consumed samples: 1431040 | consumed tokens: 2930769920 | elapsed time per iteration (s): 1.06 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.443381E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.353 | TFLOPs: 40.05 | 15: iteration 5600/ 125429 | consumed samples: 1433600 | consumed tokens: 2936012800 | elapsed time per iteration (s): 1.06 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.430129E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.232 | TFLOPs: 39.87 | 15: iteration 5610/ 125429 | consumed samples: 1436160 | consumed tokens: 2941255680 | elapsed time per iteration (s): 1.06 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.481136E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.906 | TFLOPs: 39.81 | 15: iteration 5620/ 125429 | consumed samples: 1438720 | consumed tokens: 2946498560 | elapsed time per iteration (s): 1.04 | learning rate: 1.995E-04 | global batch size: 256 | lm loss: 2.431028E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.626 | TFLOPs: 40.59 | 15: iteration 5630/ 125429 | consumed samples: 1441280 | consumed tokens: 2951741440 | elapsed time per iteration (s): 1.11 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.477543E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.797 | TFLOPs: 37.98 | 15: iteration 5640/ 125429 | consumed samples: 1443840 | consumed tokens: 2956984320 | elapsed time per iteration (s): 1.06 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.401937E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.282 | TFLOPs: 39.87 | 15: iteration 5650/ 125429 | consumed samples: 1446400 | consumed tokens: 2962227200 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.449129E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.569 | TFLOPs: 40.91 | 15: iteration 5660/ 125429 | consumed samples: 1448960 | consumed tokens: 2967470080 | elapsed time per iteration (s): 1.11 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.434481E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.448 | TFLOPs: 38.08 | 15: iteration 5670/ 125429 | consumed samples: 1451520 | consumed tokens: 2972712960 | elapsed time per iteration (s): 1.04 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.465878E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.071 | TFLOPs: 40.50 | 15: iteration 5680/ 125429 | consumed samples: 1454080 | consumed tokens: 2977955840 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.449535E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.032 | TFLOPs: 40.16 | 15: iteration 5690/ 125429 | consumed samples: 1456640 | consumed tokens: 2983198720 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.487397E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.862 | TFLOPs: 40.30 | 15: iteration 5700/ 125429 | consumed samples: 1459200 | consumed tokens: 2988441600 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.464088E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.828 | TFLOPs: 40.13 | 15: iteration 5710/ 125429 | consumed samples: 1461760 | consumed tokens: 2993684480 | elapsed time per iteration (s): 1.06 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.435269E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.168 | TFLOPs: 39.85 | 15: iteration 5720/ 125429 | consumed samples: 1464320 | consumed tokens: 2998927360 | elapsed time per iteration (s): 1.10 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.418830E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.061 | TFLOPs: 38.35 | 15: iteration 5730/ 125429 | consumed samples: 1466880 | consumed tokens: 3004170240 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.463323E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.649 | TFLOPs: 40.93 | 15: iteration 5740/ 125429 | consumed samples: 1469440 | consumed tokens: 3009413120 | elapsed time per iteration (s): 1.08 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.435069E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.084 | TFLOPs: 39.35 | 15: iteration 5750/ 125429 | consumed samples: 1472000 | consumed tokens: 3014656000 | elapsed time per iteration (s): 1.06 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.394788E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.704 | TFLOPs: 39.94 | 15: iteration 5760/ 125429 | consumed samples: 1474560 | consumed tokens: 3019898880 | elapsed time per iteration (s): 1.06 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.433828E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.228 | TFLOPs: 39.86 | 15: iteration 5770/ 125429 | consumed samples: 1477120 | consumed tokens: 3025141760 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.425764E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.904 | TFLOPs: 40.97 | 15: iteration 5780/ 125429 | consumed samples: 1479680 | consumed tokens: 3030384640 | elapsed time per iteration (s): 1.09 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.428728E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.841 | TFLOPs: 38.97 | 15: iteration 5790/ 125429 | consumed samples: 1482240 | consumed tokens: 3035627520 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.437919E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.472 | TFLOPs: 40.90 | 15: iteration 5800/ 125429 | consumed samples: 1484800 | consumed tokens: 3040870400 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.456131E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.883 | TFLOPs: 40.14 | 15: iteration 5810/ 125429 | consumed samples: 1487360 | consumed tokens: 3046113280 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.423825E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.870 | TFLOPs: 40.47 | 15: iteration 5820/ 125429 | consumed samples: 1489920 | consumed tokens: 3051356160 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.457588E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.605 | TFLOPs: 41.08 | 15: iteration 5830/ 125429 | consumed samples: 1492480 | consumed tokens: 3056599040 | elapsed time per iteration (s): 1.02 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.425066E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.141 | TFLOPs: 41.34 | 15: iteration 5840/ 125429 | consumed samples: 1495040 | consumed tokens: 3061841920 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.441709E+00 | grad norm: 0.206 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.411 | TFLOPs: 40.23 | 15: iteration 5850/ 125429 | consumed samples: 1497600 | consumed tokens: 3067084800 | elapsed time per iteration (s): 1.04 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.447419E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.131 | TFLOPs: 40.51 | 15: iteration 5860/ 125429 | consumed samples: 1500160 | consumed tokens: 3072327680 | elapsed time per iteration (s): 1.07 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.438847E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.757 | TFLOPs: 39.46 | 15: iteration 5870/ 125429 | consumed samples: 1502720 | consumed tokens: 3077570560 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.439300E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.697 | TFLOPs: 40.11 | 15: iteration 5880/ 125429 | consumed samples: 1505280 | consumed tokens: 3082813440 | elapsed time per iteration (s): 1.02 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.435086E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.266 | TFLOPs: 41.52 | 15: iteration 5890/ 125429 | consumed samples: 1507840 | consumed tokens: 3088056320 | elapsed time per iteration (s): 1.07 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.398421E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.274 | TFLOPs: 39.71 | 15: iteration 5900/ 125429 | consumed samples: 1510400 | consumed tokens: 3093299200 | elapsed time per iteration (s): 1.06 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.410579E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.110 | TFLOPs: 39.85 | 15: iteration 5910/ 125429 | consumed samples: 1512960 | consumed tokens: 3098542080 | elapsed time per iteration (s): 1.12 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.427860E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.449 | TFLOPs: 37.75 | 15: iteration 5920/ 125429 | consumed samples: 1515520 | consumed tokens: 3103784960 | elapsed time per iteration (s): 1.10 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.446522E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.343 | TFLOPs: 38.40 | 15: iteration 5930/ 125429 | consumed samples: 1518080 | consumed tokens: 3109027840 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.425500E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.262 | TFLOPs: 41.03 | 15: iteration 5940/ 125429 | consumed samples: 1520640 | consumed tokens: 3114270720 | elapsed time per iteration (s): 1.07 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.432103E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.163 | TFLOPs: 39.36 | 15: iteration 5950/ 125429 | consumed samples: 1523200 | consumed tokens: 3119513600 | elapsed time per iteration (s): 1.05 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.438182E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.337 | TFLOPs: 40.21 | 15: iteration 5960/ 125429 | consumed samples: 1525760 | consumed tokens: 3124756480 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.433656E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.027 | TFLOPs: 41.15 | 15: iteration 5970/ 125429 | consumed samples: 1528320 | consumed tokens: 3129999360 | elapsed time per iteration (s): 1.04 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.472467E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.976 | TFLOPs: 40.81 | 15: iteration 5980/ 125429 | consumed samples: 1530880 | consumed tokens: 3135242240 | elapsed time per iteration (s): 1.03 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.433702E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.300 | TFLOPs: 41.20 | 15: iteration 5990/ 125429 | consumed samples: 1533440 | consumed tokens: 3140485120 | elapsed time per iteration (s): 1.04 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.439772E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.897 | TFLOPs: 40.64 | 0: [2022-11-25 21:32:26,114] [INFO] [logging.py:68:log_dist] [Rank 0] step=6000, skipped=0, lr=[0.0001993520726489958, 0.0001993520726489958, 0.0001993520726489958], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 6000/ 125429 | consumed samples: 1536000 | consumed tokens: 3145728000 | elapsed time per iteration (s): 1.04 | learning rate: 1.994E-04 | global batch size: 256 | lm loss: 2.429167E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.518 | TFLOPs: 40.57 | 0: steps: 6000 loss: 2.4347 iter time (s): 1.062 samples/sec: 241.066 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 6000 | lm loss value: 2.460193E+00 | lm loss PPL: 1.170707E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 6000 to checkpoints_1b5 0: [2022-11-25 21:32:26,461] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step6000 is begin to save! 0: [2022-11-25 21:32:26,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:32:26,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:32:26,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:32:26,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:32:26,857] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:32:26,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:32:26,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:32:27,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:32:27,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:32:27,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:32:27,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:32:27,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:32:27,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:32:27,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:32:27,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:32:27,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:32:27,561] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:32:27,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:32:27,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:32:27,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:32:27,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:32:27,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:32:27,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:32:28,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:32:28,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:32:28,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:32:28,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:32:28,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:32:28,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:32:28,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:32:28,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:32:28,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:32:28,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:32:28,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:32:28,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:32:28,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:32:28,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:32:28,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:32:28,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:32:28,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:32:28,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:32:28,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:32:28,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:32:29,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:32:29,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:32:29,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:32:29,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:32:29,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:32:29,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:32:29,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:32:29,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:32:29,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:32:29,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:32:29,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:32:29,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_29-model_00-model_states.pt... 0: [2022-11-25 21:32:29,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_29-model_00-model_states.pt. 0: [2022-11-25 21:32:29,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:32:29,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:32:29,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/layer_32-model_00-model_states.pt... 0: [2022-11-25 21:32:29,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/layer_32-model_00-model_states.pt. 0: [2022-11-25 21:32:29,797] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step6000/mp_rank_00_model_states.pt 0: [2022-11-25 21:32:29,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:32:29,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:32:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step6000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:32:30,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 21:32:30,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 21:32:30,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:32:30,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:32:30,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 21:32:30,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:32:30,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:32:30,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:32:30,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 21:32:30,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 21:32:30,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 21:32:30,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 21:32:30,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 21:32:30,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:32:30,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:32:30,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 21:32:30,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 21:32:30,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 21:32:30,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 14: [2022-11-25 21:32:30,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:32:30,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 21:32:30,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:32:30,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 5: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 2: [2022-11-25 21:32:30,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 1: [2022-11-25 21:32:30,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:32:30,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 21:32:30,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 8: [2022-11-25 21:32:30,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:32:30,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 21:32:30,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 10: [2022-11-25 21:32:30,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:32:30,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:32:30,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 21:32:30,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 13: [2022-11-25 21:32:30,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:32:30,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 21:32:30,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:32:30,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 21:32:30,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 7: [2022-11-25 21:32:30,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 4: [2022-11-25 21:32:30,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:32:30,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:32:30,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 6: [2022-11-25 21:32:30,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:32:30,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 21:32:30,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:32:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 15: [2022-11-25 21:32:30,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:32:30,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 21:32:30,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: [2022-11-25 21:32:30,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:32:30,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 9: [2022-11-25 21:32:30,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:32:30,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:32:30,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 12: [2022-11-25 21:32:30,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:32:30,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 21:32:30,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 21:32:30,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 21:32:30,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 21:32:30,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:32:30,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 21:32:30,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 3: [2022-11-25 21:32:30,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 11: [2022-11-25 21:32:30,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:32:30,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step6000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 21:32:30,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step6000 is ready now! 0: successfully saved checkpoint at iteration 6000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3804.58 15: iteration 6010/ 125429 | consumed samples: 1538560 | consumed tokens: 3150970880 | elapsed time per iteration (s): 1.44 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.451389E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.242 | TFLOPs: 29.46 | 15: iteration 6020/ 125429 | consumed samples: 1541120 | consumed tokens: 3156213760 | elapsed time per iteration (s): 1.02 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.419643E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.792 | TFLOPs: 41.45 | 15: iteration 6030/ 125429 | consumed samples: 1543680 | consumed tokens: 3161456640 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.411354E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.070 | TFLOPs: 40.66 | 15: iteration 6040/ 125429 | consumed samples: 1546240 | consumed tokens: 3166699520 | elapsed time per iteration (s): 1.07 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.453844E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.149 | TFLOPs: 39.36 | 15: iteration 6050/ 125429 | consumed samples: 1548800 | consumed tokens: 3171942400 | elapsed time per iteration (s): 1.06 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.402465E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.718 | TFLOPs: 39.95 | 15: iteration 6060/ 125429 | consumed samples: 1551360 | consumed tokens: 3177185280 | elapsed time per iteration (s): 1.02 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.412519E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.659 | TFLOPs: 41.42 | 15: iteration 6070/ 125429 | consumed samples: 1553920 | consumed tokens: 3182428160 | elapsed time per iteration (s): 1.03 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.447490E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.675 | TFLOPs: 41.26 | 15: iteration 6080/ 125429 | consumed samples: 1556480 | consumed tokens: 3187671040 | elapsed time per iteration (s): 1.08 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.438825E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.355 | TFLOPs: 39.22 | 15: iteration 6090/ 125429 | consumed samples: 1559040 | consumed tokens: 3192913920 | elapsed time per iteration (s): 1.08 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.417153E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.281 | TFLOPs: 39.21 | 15: iteration 6100/ 125429 | consumed samples: 1561600 | consumed tokens: 3198156800 | elapsed time per iteration (s): 1.03 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.438690E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.373 | TFLOPs: 41.21 | 15: iteration 6110/ 125429 | consumed samples: 1564160 | consumed tokens: 3203399680 | elapsed time per iteration (s): 1.06 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.408238E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.221 | TFLOPs: 40.03 | 15: iteration 6120/ 125429 | consumed samples: 1566720 | consumed tokens: 3208642560 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.391344E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.836 | TFLOPs: 40.79 | 15: iteration 6130/ 125429 | consumed samples: 1569280 | consumed tokens: 3213885440 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.425034E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.833 | TFLOPs: 40.79 | 15: iteration 6140/ 125429 | consumed samples: 1571840 | consumed tokens: 3219128320 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.409310E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.128 | TFLOPs: 40.84 | 15: iteration 6150/ 125429 | consumed samples: 1574400 | consumed tokens: 3224371200 | elapsed time per iteration (s): 1.05 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 3.414956E+00 | grad norm: 35.519 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.163 | TFLOPs: 40.35 | 15: iteration 6160/ 125429 | consumed samples: 1576960 | consumed tokens: 3229614080 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 4.747609E+00 | grad norm: 14.705 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.426 | TFLOPs: 40.72 | 15: iteration 6170/ 125429 | consumed samples: 1579520 | consumed tokens: 3234856960 | elapsed time per iteration (s): 1.11 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 4.194714E+00 | grad norm: 4.666 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.950 | TFLOPs: 38.17 | 15: iteration 6180/ 125429 | consumed samples: 1582080 | consumed tokens: 3240099840 | elapsed time per iteration (s): 1.05 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 3.253312E+00 | grad norm: 2.617 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.758 | TFLOPs: 40.12 | 15: iteration 6190/ 125429 | consumed samples: 1584640 | consumed tokens: 3245342720 | elapsed time per iteration (s): 1.08 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.939865E+00 | grad norm: 1.064 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.062 | TFLOPs: 39.34 | 15: iteration 6200/ 125429 | consumed samples: 1587200 | consumed tokens: 3250585600 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.746679E+00 | grad norm: 0.436 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.233 | TFLOPs: 40.69 | 15: iteration 6210/ 125429 | consumed samples: 1589760 | consumed tokens: 3255828480 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.637168E+00 | grad norm: 0.395 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.070 | TFLOPs: 40.66 | 15: iteration 6220/ 125429 | consumed samples: 1592320 | consumed tokens: 3261071360 | elapsed time per iteration (s): 1.06 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.553382E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.011 | TFLOPs: 39.83 | 15: iteration 6230/ 125429 | consumed samples: 1594880 | consumed tokens: 3266314240 | elapsed time per iteration (s): 1.05 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.533305E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.216 | TFLOPs: 40.36 | 15: iteration 6240/ 125429 | consumed samples: 1597440 | consumed tokens: 3271557120 | elapsed time per iteration (s): 1.05 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.513194E+00 | grad norm: 0.236 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.975 | TFLOPs: 40.32 | 15: iteration 6250/ 125429 | consumed samples: 1600000 | consumed tokens: 3276800000 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.465931E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.465 | TFLOPs: 40.73 | 15: iteration 6260/ 125429 | consumed samples: 1602560 | consumed tokens: 3282042880 | elapsed time per iteration (s): 1.05 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.493166E+00 | grad norm: 0.221 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.778 | TFLOPs: 40.29 | 15: iteration 6270/ 125429 | consumed samples: 1605120 | consumed tokens: 3287285760 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.461895E+00 | grad norm: 0.339 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.140 | TFLOPs: 40.51 | 15: iteration 6280/ 125429 | consumed samples: 1607680 | consumed tokens: 3292528640 | elapsed time per iteration (s): 1.10 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.455425E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.511 | TFLOPs: 38.42 | 15: iteration 6290/ 125429 | consumed samples: 1610240 | consumed tokens: 3297771520 | elapsed time per iteration (s): 1.03 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.442468E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.820 | TFLOPs: 41.12 | 15: iteration 6300/ 125429 | consumed samples: 1612800 | consumed tokens: 3303014400 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.446231E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.613 | TFLOPs: 40.59 | 15: iteration 6310/ 125429 | consumed samples: 1615360 | consumed tokens: 3308257280 | elapsed time per iteration (s): 1.07 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.429249E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.287 | TFLOPs: 39.54 | 15: iteration 6320/ 125429 | consumed samples: 1617920 | consumed tokens: 3313500160 | elapsed time per iteration (s): 1.04 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.449586E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.841 | TFLOPs: 40.79 | 15: iteration 6330/ 125429 | consumed samples: 1620480 | consumed tokens: 3318743040 | elapsed time per iteration (s): 1.06 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.473912E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.634 | TFLOPs: 39.77 | 15: iteration 6340/ 125429 | consumed samples: 1623040 | consumed tokens: 3323985920 | elapsed time per iteration (s): 1.06 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.403574E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.436 | TFLOPs: 39.90 | 15: iteration 6350/ 125429 | consumed samples: 1625600 | consumed tokens: 3329228800 | elapsed time per iteration (s): 1.06 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.438043E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.569 | TFLOPs: 39.76 | 15: iteration 6360/ 125429 | consumed samples: 1628160 | consumed tokens: 3334471680 | elapsed time per iteration (s): 1.07 | learning rate: 1.993E-04 | global batch size: 256 | lm loss: 2.432722E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.930 | TFLOPs: 39.65 | 15: iteration 6370/ 125429 | consumed samples: 1630720 | consumed tokens: 3339714560 | elapsed time per iteration (s): 1.07 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.450702E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.487 | TFLOPs: 39.58 | 15: iteration 6380/ 125429 | consumed samples: 1633280 | consumed tokens: 3344957440 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.415333E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.033 | TFLOPs: 40.16 | 15: iteration 6390/ 125429 | consumed samples: 1635840 | consumed tokens: 3350200320 | elapsed time per iteration (s): 1.07 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.425045E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.633 | TFLOPs: 39.44 | 15: iteration 6400/ 125429 | consumed samples: 1638400 | consumed tokens: 3355443200 | elapsed time per iteration (s): 1.04 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.425914E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.968 | TFLOPs: 40.65 | 15: iteration 6410/ 125429 | consumed samples: 1640960 | consumed tokens: 3360686080 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.445404E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.575 | TFLOPs: 40.42 | 15: iteration 6420/ 125429 | consumed samples: 1643520 | consumed tokens: 3365928960 | elapsed time per iteration (s): 1.04 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.403704E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.787 | TFLOPs: 40.62 | 15: iteration 6430/ 125429 | consumed samples: 1646080 | consumed tokens: 3371171840 | elapsed time per iteration (s): 1.02 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.423216E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.797 | TFLOPs: 41.28 | 15: iteration 6440/ 125429 | consumed samples: 1648640 | consumed tokens: 3376414720 | elapsed time per iteration (s): 1.03 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.423603E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.157 | TFLOPs: 41.01 | 15: iteration 6450/ 125429 | consumed samples: 1651200 | consumed tokens: 3381657600 | elapsed time per iteration (s): 1.06 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.419278E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.040 | TFLOPs: 40.00 | 15: iteration 6460/ 125429 | consumed samples: 1653760 | consumed tokens: 3386900480 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.384880E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.099 | TFLOPs: 40.17 | 15: iteration 6470/ 125429 | consumed samples: 1656320 | consumed tokens: 3392143360 | elapsed time per iteration (s): 1.04 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.401001E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.743 | TFLOPs: 40.78 | 15: iteration 6480/ 125429 | consumed samples: 1658880 | consumed tokens: 3397386240 | elapsed time per iteration (s): 1.03 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.427460E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.124 | TFLOPs: 41.00 | 15: iteration 6490/ 125429 | consumed samples: 1661440 | consumed tokens: 3402629120 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.409815E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.558 | TFLOPs: 40.42 | 15: iteration 6500/ 125429 | consumed samples: 1664000 | consumed tokens: 3407872000 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.388745E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.038 | TFLOPs: 40.33 | 15: iteration 6510/ 125429 | consumed samples: 1666560 | consumed tokens: 3413114880 | elapsed time per iteration (s): 1.04 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.417161E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.474 | TFLOPs: 40.57 | 15: iteration 6520/ 125429 | consumed samples: 1669120 | consumed tokens: 3418357760 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.428632E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.962 | TFLOPs: 40.48 | 15: iteration 6530/ 125429 | consumed samples: 1671680 | consumed tokens: 3423600640 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.394341E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.141 | TFLOPs: 40.18 | 15: iteration 6540/ 125429 | consumed samples: 1674240 | consumed tokens: 3428843520 | elapsed time per iteration (s): 1.04 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.399508E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.550 | TFLOPs: 40.74 | 15: iteration 6550/ 125429 | consumed samples: 1676800 | consumed tokens: 3434086400 | elapsed time per iteration (s): 1.07 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.429918E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.181 | TFLOPs: 39.53 | 15: iteration 6560/ 125429 | consumed samples: 1679360 | consumed tokens: 3439329280 | elapsed time per iteration (s): 1.07 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.416550E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.031 | TFLOPs: 39.50 | 15: iteration 6570/ 125429 | consumed samples: 1681920 | consumed tokens: 3444572160 | elapsed time per iteration (s): 1.02 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.416881E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.484 | TFLOPs: 41.39 | 15: iteration 6580/ 125429 | consumed samples: 1684480 | consumed tokens: 3449815040 | elapsed time per iteration (s): 1.02 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.423141E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.559 | TFLOPs: 41.57 | 15: iteration 6590/ 125429 | consumed samples: 1687040 | consumed tokens: 3455057920 | elapsed time per iteration (s): 1.04 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.408696E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.528 | TFLOPs: 40.74 | 15: iteration 6600/ 125429 | consumed samples: 1689600 | consumed tokens: 3460300800 | elapsed time per iteration (s): 1.06 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.411156E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.118 | TFLOPs: 39.85 | 15: iteration 6610/ 125429 | consumed samples: 1692160 | consumed tokens: 3465543680 | elapsed time per iteration (s): 1.07 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.419447E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.758 | TFLOPs: 39.46 | 15: iteration 6620/ 125429 | consumed samples: 1694720 | consumed tokens: 3470786560 | elapsed time per iteration (s): 1.07 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.508820E+00 | grad norm: 2.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.846 | TFLOPs: 39.47 | 15: iteration 6630/ 125429 | consumed samples: 1697280 | consumed tokens: 3476029440 | elapsed time per iteration (s): 1.04 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.601259E+00 | grad norm: 0.528 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.064 | TFLOPs: 40.50 | 15: iteration 6640/ 125429 | consumed samples: 1699840 | consumed tokens: 3481272320 | elapsed time per iteration (s): 1.07 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.488534E+00 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.667 | TFLOPs: 39.61 | 15: iteration 6650/ 125429 | consumed samples: 1702400 | consumed tokens: 3486515200 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.429592E+00 | grad norm: 0.229 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.001 | TFLOPs: 40.16 | 15: iteration 6660/ 125429 | consumed samples: 1704960 | consumed tokens: 3491758080 | elapsed time per iteration (s): 1.16 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.461917E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.466 | TFLOPs: 36.43 | 15: iteration 6670/ 125429 | consumed samples: 1707520 | consumed tokens: 3497000960 | elapsed time per iteration (s): 1.03 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.418649E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.851 | TFLOPs: 41.12 | 15: iteration 6680/ 125429 | consumed samples: 1710080 | consumed tokens: 3502243840 | elapsed time per iteration (s): 1.05 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.420950E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.306 | TFLOPs: 40.21 | 15: iteration 6690/ 125429 | consumed samples: 1712640 | consumed tokens: 3507486720 | elapsed time per iteration (s): 1.03 | learning rate: 1.992E-04 | global batch size: 256 | lm loss: 2.427708E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.863 | TFLOPs: 41.13 | 15: iteration 6700/ 125429 | consumed samples: 1715200 | consumed tokens: 3512729600 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.430040E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.392 | TFLOPs: 40.55 | 15: iteration 6710/ 125429 | consumed samples: 1717760 | consumed tokens: 3517972480 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.437723E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.239 | TFLOPs: 40.53 | 15: iteration 6720/ 125429 | consumed samples: 1720320 | consumed tokens: 3523215360 | elapsed time per iteration (s): 1.03 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.392889E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.687 | TFLOPs: 41.10 | 15: iteration 6730/ 125429 | consumed samples: 1722880 | consumed tokens: 3528458240 | elapsed time per iteration (s): 1.05 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.435588E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.388 | TFLOPs: 40.39 | 15: iteration 6740/ 125429 | consumed samples: 1725440 | consumed tokens: 3533701120 | elapsed time per iteration (s): 1.03 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.411367E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.470 | TFLOPs: 41.06 | 15: iteration 6750/ 125429 | consumed samples: 1728000 | consumed tokens: 3538944000 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.386449E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.642 | TFLOPs: 40.59 | 15: iteration 6760/ 125429 | consumed samples: 1730560 | consumed tokens: 3544186880 | elapsed time per iteration (s): 1.06 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.406735E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.757 | TFLOPs: 39.79 | 15: iteration 6770/ 125429 | consumed samples: 1733120 | consumed tokens: 3549429760 | elapsed time per iteration (s): 1.06 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.368293E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.548 | TFLOPs: 40.08 | 15: iteration 6780/ 125429 | consumed samples: 1735680 | consumed tokens: 3554672640 | elapsed time per iteration (s): 1.02 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.381609E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.679 | TFLOPs: 41.43 | 15: iteration 6790/ 125429 | consumed samples: 1738240 | consumed tokens: 3559915520 | elapsed time per iteration (s): 1.05 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.401834E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.754 | TFLOPs: 40.12 | 15: iteration 6800/ 125429 | consumed samples: 1740800 | consumed tokens: 3565158400 | elapsed time per iteration (s): 1.03 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.448440E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.621 | TFLOPs: 40.92 | 15: iteration 6810/ 125429 | consumed samples: 1743360 | consumed tokens: 3570401280 | elapsed time per iteration (s): 1.08 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.393954E+00 | grad norm: 0.306 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.209 | TFLOPs: 39.04 | 15: iteration 6820/ 125429 | consumed samples: 1745920 | consumed tokens: 3575644160 | elapsed time per iteration (s): 1.06 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.394045E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.859 | TFLOPs: 39.80 | 15: iteration 6830/ 125429 | consumed samples: 1748480 | consumed tokens: 3580887040 | elapsed time per iteration (s): 1.06 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.443639E+00 | grad norm: 0.437 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.044 | TFLOPs: 40.00 | 15: iteration 6840/ 125429 | consumed samples: 1751040 | consumed tokens: 3586129920 | elapsed time per iteration (s): 1.05 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.423827E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.459 | TFLOPs: 40.23 | 15: iteration 6850/ 125429 | consumed samples: 1753600 | consumed tokens: 3591372800 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.408177E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.805 | TFLOPs: 40.62 | 15: iteration 6860/ 125429 | consumed samples: 1756160 | consumed tokens: 3596615680 | elapsed time per iteration (s): 1.06 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.386339E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.046 | TFLOPs: 40.00 | 15: iteration 6870/ 125429 | consumed samples: 1758720 | consumed tokens: 3601858560 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.432049E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.516 | TFLOPs: 40.74 | 15: iteration 6880/ 125429 | consumed samples: 1761280 | consumed tokens: 3607101440 | elapsed time per iteration (s): 1.03 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.418616E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.711 | TFLOPs: 40.94 | 15: iteration 6890/ 125429 | consumed samples: 1763840 | consumed tokens: 3612344320 | elapsed time per iteration (s): 1.08 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.354153E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.526 | TFLOPs: 39.25 | 15: iteration 6900/ 125429 | consumed samples: 1766400 | consumed tokens: 3617587200 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.362012E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.736 | TFLOPs: 40.78 | 15: iteration 6910/ 125429 | consumed samples: 1768960 | consumed tokens: 3622830080 | elapsed time per iteration (s): 1.08 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.395114E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.826 | TFLOPs: 39.30 | 15: iteration 6920/ 125429 | consumed samples: 1771520 | consumed tokens: 3628072960 | elapsed time per iteration (s): 1.06 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.391079E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.472 | TFLOPs: 40.07 | 15: iteration 6930/ 125429 | consumed samples: 1774080 | consumed tokens: 3633315840 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.385863E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.572 | TFLOPs: 40.58 | 15: iteration 6940/ 125429 | consumed samples: 1776640 | consumed tokens: 3638558720 | elapsed time per iteration (s): 1.03 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.403974E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.620 | TFLOPs: 41.25 | 15: iteration 6950/ 125429 | consumed samples: 1779200 | consumed tokens: 3643801600 | elapsed time per iteration (s): 1.06 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.379018E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.974 | TFLOPs: 39.99 | 15: iteration 6960/ 125429 | consumed samples: 1781760 | consumed tokens: 3649044480 | elapsed time per iteration (s): 1.07 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.418849E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.740 | TFLOPs: 39.45 | 15: iteration 6970/ 125429 | consumed samples: 1784320 | consumed tokens: 3654287360 | elapsed time per iteration (s): 1.04 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.379772E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.257 | TFLOPs: 40.53 | 15: iteration 6980/ 125429 | consumed samples: 1786880 | consumed tokens: 3659530240 | elapsed time per iteration (s): 1.03 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.359324E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.131 | TFLOPs: 41.17 | 15: iteration 6990/ 125429 | consumed samples: 1789440 | consumed tokens: 3664773120 | elapsed time per iteration (s): 1.08 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.412446E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.141 | TFLOPs: 39.19 | 15: iteration 7000/ 125429 | consumed samples: 1792000 | consumed tokens: 3670016000 | elapsed time per iteration (s): 1.10 | learning rate: 1.991E-04 | global batch size: 256 | lm loss: 2.398097E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.667 | TFLOPs: 38.62 | 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 7000 | lm loss value: 2.359874E+00 | lm loss PPL: 1.058962E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 7000 to checkpoints_1b5 0: [2022-11-25 21:50:01,937] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step7000 is begin to save! 0: [2022-11-25 21:50:01,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_01-model_00-model_states.pt... 0: [2022-11-25 21:50:02,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_01-model_00-model_states.pt. 0: [2022-11-25 21:50:02,198] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_03-model_00-model_states.pt... 0: [2022-11-25 21:50:02,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_03-model_00-model_states.pt. 0: [2022-11-25 21:50:02,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_04-model_00-model_states.pt... 0: [2022-11-25 21:50:02,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_04-model_00-model_states.pt. 0: [2022-11-25 21:50:02,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_05-model_00-model_states.pt... 0: [2022-11-25 21:50:02,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_05-model_00-model_states.pt. 0: [2022-11-25 21:50:02,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_06-model_00-model_states.pt... 0: [2022-11-25 21:50:02,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_06-model_00-model_states.pt. 0: [2022-11-25 21:50:02,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_07-model_00-model_states.pt... 0: [2022-11-25 21:50:02,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_07-model_00-model_states.pt. 0: [2022-11-25 21:50:02,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_08-model_00-model_states.pt... 0: [2022-11-25 21:50:02,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_08-model_00-model_states.pt. 0: [2022-11-25 21:50:02,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_09-model_00-model_states.pt... 0: [2022-11-25 21:50:02,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_09-model_00-model_states.pt. 0: [2022-11-25 21:50:02,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_10-model_00-model_states.pt... 0: [2022-11-25 21:50:03,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_10-model_00-model_states.pt. 0: [2022-11-25 21:50:03,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_11-model_00-model_states.pt... 0: [2022-11-25 21:50:03,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_11-model_00-model_states.pt. 0: [2022-11-25 21:50:03,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_12-model_00-model_states.pt... 0: [2022-11-25 21:50:03,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_12-model_00-model_states.pt. 0: [2022-11-25 21:50:03,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_13-model_00-model_states.pt... 0: [2022-11-25 21:50:03,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_13-model_00-model_states.pt. 0: [2022-11-25 21:50:03,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_14-model_00-model_states.pt... 0: [2022-11-25 21:50:03,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_14-model_00-model_states.pt. 0: [2022-11-25 21:50:03,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_15-model_00-model_states.pt... 0: [2022-11-25 21:50:03,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_15-model_00-model_states.pt. 0: [2022-11-25 21:50:03,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_16-model_00-model_states.pt... 0: [2022-11-25 21:50:03,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_16-model_00-model_states.pt. 0: [2022-11-25 21:50:03,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_17-model_00-model_states.pt... 0: [2022-11-25 21:50:03,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_17-model_00-model_states.pt. 0: [2022-11-25 21:50:03,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_18-model_00-model_states.pt... 0: [2022-11-25 21:50:03,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_18-model_00-model_states.pt. 0: [2022-11-25 21:50:03,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_19-model_00-model_states.pt... 0: [2022-11-25 21:50:03,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_19-model_00-model_states.pt. 0: [2022-11-25 21:50:03,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_20-model_00-model_states.pt... 0: [2022-11-25 21:50:04,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_20-model_00-model_states.pt. 0: [2022-11-25 21:50:04,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_21-model_00-model_states.pt... 0: [2022-11-25 21:50:04,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_21-model_00-model_states.pt. 0: [2022-11-25 21:50:04,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_22-model_00-model_states.pt... 0: [2022-11-25 21:50:04,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_22-model_00-model_states.pt. 0: [2022-11-25 21:50:04,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_23-model_00-model_states.pt... 0: [2022-11-25 21:50:04,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_23-model_00-model_states.pt. 0: [2022-11-25 21:50:04,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_24-model_00-model_states.pt... 0: [2022-11-25 21:50:04,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_24-model_00-model_states.pt. 0: [2022-11-25 21:50:04,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_25-model_00-model_states.pt... 0: [2022-11-25 21:50:04,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_25-model_00-model_states.pt. 0: [2022-11-25 21:50:04,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_26-model_00-model_states.pt... 0: [2022-11-25 21:50:04,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_26-model_00-model_states.pt. 0: [2022-11-25 21:50:04,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_27-model_00-model_states.pt... 0: [2022-11-25 21:50:04,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_27-model_00-model_states.pt. 0: [2022-11-25 21:50:04,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_28-model_00-model_states.pt... 0: [2022-11-25 21:50:04,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_28-model_00-model_states.pt. 0: [2022-11-25 21:50:04,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_29-model_00-model_states.pt... 0: [2022-11-25 21:50:04,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_29-model_00-model_states.pt. 0: [2022-11-25 21:50:04,910] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_30-model_00-model_states.pt... 0: [2022-11-25 21:50:05,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_30-model_00-model_states.pt. 0: [2022-11-25 21:50:05,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/layer_32-model_00-model_states.pt... 0: [2022-11-25 21:50:05,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/layer_32-model_00-model_states.pt. 0: [2022-11-25 21:50:05,009] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step7000/mp_rank_00_model_states.pt 0: [2022-11-25 21:50:05,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/mp_rank_00_model_states.pt... 0: [2022-11-25 21:50:05,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/mp_rank_00_model_states.pt. 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 5: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 8: [2022-11-25 21:50:05,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step7000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 15: [2022-11-25 21:50:05,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 21:50:05,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 21:50:05,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 21:50:05,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 21:50:05,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 21:50:05,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 21:50:05,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 21:50:05,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 21:50:05,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 21:50:05,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 21:50:05,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 21:50:05,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 3: [2022-11-25 21:50:05,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 21:50:05,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 21:50:05,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 21:50:05,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:50:05,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:50:05,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:50:05,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 21:50:05,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 21:50:05,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 6: [2022-11-25 21:50:05,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 21:50:05,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 21:50:05,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 21:50:05,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 21:50:05,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 21:50:05,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:50:05,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 9: [2022-11-25 21:50:05,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 21:50:05,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 21:50:05,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:50:05,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:50:05,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 21:50:05,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 21:50:05,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:50:05,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:50:05,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:50:05,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:50:05,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:50:05,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:50:05,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:50:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:50:05,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:50:05,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:50:05,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:50:05,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:50:05,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 15: [2022-11-25 21:50:05,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 21:50:05,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 21:50:05,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 21:50:05,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 21:50:05,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 5: [2022-11-25 21:50:05,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 21:50:05,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 21:50:05,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:50:05,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:50:05,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:50:05,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 21:50:05,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:50:05,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:50:05,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 8: [2022-11-25 21:50:05,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 21:50:05,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 21:50:05,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 12: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 21:50:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 21:50:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 7: [2022-11-25 21:50:05,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 21:50:05,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 21:50:05,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:50:05,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 11: [2022-11-25 21:50:05,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 21:50:05,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 21:50:05,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 13: [2022-11-25 21:50:05,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 21:50:05,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 21:50:05,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 21:50:05,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 21:50:05,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: [2022-11-25 21:50:05,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 21:50:05,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 1: [2022-11-25 21:50:05,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 21:50:05,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 21:50:05,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 3: [2022-11-25 21:50:05,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 21:50:05,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 21:50:05,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 10: [2022-11-25 21:50:05,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 4: [2022-11-25 21:50:05,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 21:50:05,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 21:50:05,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 2: [2022-11-25 21:50:05,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 21:50:05,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 21:50:05,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step7000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 21:50:05,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 14: [2022-11-25 21:50:05,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step7000 is ready now! 0: successfully saved checkpoint at iteration 7000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3712.99 15: iteration 7010/ 125429 | consumed samples: 1794560 | consumed tokens: 3675258880 | elapsed time per iteration (s): 1.48 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.373415E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.623 | TFLOPs: 28.53 | 15: iteration 7020/ 125429 | consumed samples: 1797120 | consumed tokens: 3680501760 | elapsed time per iteration (s): 1.04 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.421165E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.144 | TFLOPs: 40.84 | 15: iteration 7030/ 125429 | consumed samples: 1799680 | consumed tokens: 3685744640 | elapsed time per iteration (s): 1.06 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.366452E+00 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.995 | TFLOPs: 39.83 | 15: iteration 7040/ 125429 | consumed samples: 1802240 | consumed tokens: 3690987520 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.410494E+00 | grad norm: 0.196 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.484 | TFLOPs: 40.24 | 15: iteration 7050/ 125429 | consumed samples: 1804800 | consumed tokens: 3696230400 | elapsed time per iteration (s): 1.02 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.408164E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.498 | TFLOPs: 41.40 | 15: iteration 7060/ 125429 | consumed samples: 1807360 | consumed tokens: 3701473280 | elapsed time per iteration (s): 1.06 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.378199E+00 | grad norm: 0.288 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.707 | TFLOPs: 39.78 | 15: iteration 7070/ 125429 | consumed samples: 1809920 | consumed tokens: 3706716160 | elapsed time per iteration (s): 1.08 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.393827E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.851 | TFLOPs: 39.14 | 15: iteration 7080/ 125429 | consumed samples: 1812480 | consumed tokens: 3711959040 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.374314E+00 | grad norm: 0.214 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.573 | TFLOPs: 40.42 | 15: iteration 7090/ 125429 | consumed samples: 1815040 | consumed tokens: 3717201920 | elapsed time per iteration (s): 1.03 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.398124E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.016 | TFLOPs: 41.15 | 15: iteration 7100/ 125429 | consumed samples: 1817600 | consumed tokens: 3722444800 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.361650E+00 | grad norm: 0.408 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.909 | TFLOPs: 40.14 | 15: iteration 7110/ 125429 | consumed samples: 1820160 | consumed tokens: 3727687680 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.423015E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.799 | TFLOPs: 40.45 | 15: iteration 7120/ 125429 | consumed samples: 1822720 | consumed tokens: 3732930560 | elapsed time per iteration (s): 1.02 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.389910E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.647 | TFLOPs: 41.42 | 15: iteration 7130/ 125429 | consumed samples: 1825280 | consumed tokens: 3738173440 | elapsed time per iteration (s): 1.04 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.402984E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.767 | TFLOPs: 40.61 | 15: iteration 7140/ 125429 | consumed samples: 1827840 | consumed tokens: 3743416320 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.377572E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.779 | TFLOPs: 40.12 | 15: iteration 7150/ 125429 | consumed samples: 1830400 | consumed tokens: 3748659200 | elapsed time per iteration (s): 1.03 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.386323E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.088 | TFLOPs: 41.16 | 15: iteration 7160/ 125429 | consumed samples: 1832960 | consumed tokens: 3753902080 | elapsed time per iteration (s): 1.03 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.408784E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.116 | TFLOPs: 41.00 | 15: iteration 7170/ 125429 | consumed samples: 1835520 | consumed tokens: 3759144960 | elapsed time per iteration (s): 1.02 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.409486E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.142 | TFLOPs: 41.34 | 15: iteration 7180/ 125429 | consumed samples: 1838080 | consumed tokens: 3764387840 | elapsed time per iteration (s): 1.03 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.391104E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.155 | TFLOPs: 41.01 | 15: iteration 7190/ 125429 | consumed samples: 1840640 | consumed tokens: 3769630720 | elapsed time per iteration (s): 1.03 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.363490E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.847 | TFLOPs: 41.12 | 15: iteration 7200/ 125429 | consumed samples: 1843200 | consumed tokens: 3774873600 | elapsed time per iteration (s): 1.07 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.372401E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.207 | TFLOPs: 39.70 | 15: iteration 7210/ 125429 | consumed samples: 1845760 | consumed tokens: 3780116480 | elapsed time per iteration (s): 1.04 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.383696E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.016 | TFLOPs: 40.49 | 15: iteration 7220/ 125429 | consumed samples: 1848320 | consumed tokens: 3785359360 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.380889E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.767 | TFLOPs: 40.12 | 15: iteration 7230/ 125429 | consumed samples: 1850880 | consumed tokens: 3790602240 | elapsed time per iteration (s): 1.04 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.390355E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.894 | TFLOPs: 40.64 | 15: iteration 7240/ 125429 | consumed samples: 1853440 | consumed tokens: 3795845120 | elapsed time per iteration (s): 1.03 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.399291E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.513 | TFLOPs: 41.07 | 15: iteration 7250/ 125429 | consumed samples: 1856000 | consumed tokens: 3801088000 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.410679E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.748 | TFLOPs: 40.28 | 15: iteration 7260/ 125429 | consumed samples: 1858560 | consumed tokens: 3806330880 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.360519E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.422 | TFLOPs: 40.23 | 15: iteration 7270/ 125429 | consumed samples: 1861120 | consumed tokens: 3811573760 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.398779E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.835 | TFLOPs: 40.46 | 15: iteration 7280/ 125429 | consumed samples: 1863680 | consumed tokens: 3816816640 | elapsed time per iteration (s): 1.05 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.400861E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.584 | TFLOPs: 40.42 | 15: iteration 7290/ 125429 | consumed samples: 1866240 | consumed tokens: 3822059520 | elapsed time per iteration (s): 1.02 | learning rate: 1.990E-04 | global batch size: 256 | lm loss: 2.393415E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.416 | TFLOPs: 41.55 | 15: iteration 7300/ 125429 | consumed samples: 1868800 | consumed tokens: 3827302400 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.399833E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.874 | TFLOPs: 40.80 | 15: iteration 7310/ 125429 | consumed samples: 1871360 | consumed tokens: 3832545280 | elapsed time per iteration (s): 1.06 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.388552E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.432 | TFLOPs: 40.06 | 15: iteration 7320/ 125429 | consumed samples: 1873920 | consumed tokens: 3837788160 | elapsed time per iteration (s): 1.06 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.371815E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.465 | TFLOPs: 40.07 | 15: iteration 7330/ 125429 | consumed samples: 1876480 | consumed tokens: 3843031040 | elapsed time per iteration (s): 1.08 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.386697E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.154 | TFLOPs: 39.19 | 15: iteration 7340/ 125429 | consumed samples: 1879040 | consumed tokens: 3848273920 | elapsed time per iteration (s): 1.07 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.357113E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.206 | TFLOPs: 39.70 | 15: iteration 7350/ 125429 | consumed samples: 1881600 | consumed tokens: 3853516800 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.377482E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.380 | TFLOPs: 40.72 | 15: iteration 7360/ 125429 | consumed samples: 1884160 | consumed tokens: 3858759680 | elapsed time per iteration (s): 1.05 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.393245E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.173 | TFLOPs: 40.19 | 15: iteration 7370/ 125429 | consumed samples: 1886720 | consumed tokens: 3864002560 | elapsed time per iteration (s): 1.05 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.375520E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.704 | TFLOPs: 40.11 | 15: iteration 7380/ 125429 | consumed samples: 1889280 | consumed tokens: 3869245440 | elapsed time per iteration (s): 1.02 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.380183E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.241 | TFLOPs: 41.35 | 15: iteration 7390/ 125429 | consumed samples: 1891840 | consumed tokens: 3874488320 | elapsed time per iteration (s): 1.03 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.350337E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.890 | TFLOPs: 40.97 | 15: iteration 7400/ 125429 | consumed samples: 1894400 | consumed tokens: 3879731200 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.387435E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.376 | TFLOPs: 40.55 | 15: iteration 7410/ 125429 | consumed samples: 1896960 | consumed tokens: 3884974080 | elapsed time per iteration (s): 1.03 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.557683E+00 | grad norm: 5.931 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.739 | TFLOPs: 40.94 | 15: iteration 7420/ 125429 | consumed samples: 1899520 | consumed tokens: 3890216960 | elapsed time per iteration (s): 1.08 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.660803E+00 | grad norm: 1.001 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.920 | TFLOPs: 39.15 | 15: iteration 7430/ 125429 | consumed samples: 1902080 | consumed tokens: 3895459840 | elapsed time per iteration (s): 1.03 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.486649E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.015 | TFLOPs: 41.15 | 15: iteration 7440/ 125429 | consumed samples: 1904640 | consumed tokens: 3900702720 | elapsed time per iteration (s): 1.05 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.387427E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.856 | TFLOPs: 40.46 | 15: iteration 7450/ 125429 | consumed samples: 1907200 | consumed tokens: 3905945600 | elapsed time per iteration (s): 1.09 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.410211E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.917 | TFLOPs: 38.82 | 15: iteration 7460/ 125429 | consumed samples: 1909760 | consumed tokens: 3911188480 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.412788E+00 | grad norm: 0.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.047 | TFLOPs: 40.66 | 15: iteration 7470/ 125429 | consumed samples: 1912320 | consumed tokens: 3916431360 | elapsed time per iteration (s): 1.05 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.386432E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.389 | TFLOPs: 40.22 | 15: iteration 7480/ 125429 | consumed samples: 1914880 | consumed tokens: 3921674240 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.377135E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.822 | TFLOPs: 40.62 | 15: iteration 7490/ 125429 | consumed samples: 1917440 | consumed tokens: 3926917120 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.400902E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.837 | TFLOPs: 40.63 | 15: iteration 7500/ 125429 | consumed samples: 1920000 | consumed tokens: 3932160000 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.359070E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.354 | TFLOPs: 40.71 | 15: iteration 7510/ 125429 | consumed samples: 1922560 | consumed tokens: 3937402880 | elapsed time per iteration (s): 1.02 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.366334E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.116 | TFLOPs: 41.50 | 15: iteration 7520/ 125429 | consumed samples: 1925120 | consumed tokens: 3942645760 | elapsed time per iteration (s): 1.08 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.407596E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.089 | TFLOPs: 39.35 | 15: iteration 7530/ 125429 | consumed samples: 1927680 | consumed tokens: 3947888640 | elapsed time per iteration (s): 1.04 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.407306E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.876 | TFLOPs: 40.63 | 15: iteration 7540/ 125429 | consumed samples: 1930240 | consumed tokens: 3953131520 | elapsed time per iteration (s): 1.06 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.388222E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.395 | TFLOPs: 39.89 | 15: iteration 7550/ 125429 | consumed samples: 1932800 | consumed tokens: 3958374400 | elapsed time per iteration (s): 1.07 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.364399E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.636 | TFLOPs: 39.60 | 15: iteration 7560/ 125429 | consumed samples: 1935360 | consumed tokens: 3963617280 | elapsed time per iteration (s): 1.03 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.365983E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.569 | TFLOPs: 41.24 | 15: iteration 7570/ 125429 | consumed samples: 1937920 | consumed tokens: 3968860160 | elapsed time per iteration (s): 1.03 | learning rate: 1.989E-04 | global batch size: 256 | lm loss: 2.373061E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.241 | TFLOPs: 41.02 | 15: iteration 7580/ 125429 | consumed samples: 1940480 | consumed tokens: 3974103040 | elapsed time per iteration (s): 1.04 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.397164E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.494 | TFLOPs: 40.74 | 15: iteration 7590/ 125429 | consumed samples: 1943040 | consumed tokens: 3979345920 | elapsed time per iteration (s): 1.04 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.400960E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.410 | TFLOPs: 40.72 | 15: iteration 7600/ 125429 | consumed samples: 1945600 | consumed tokens: 3984588800 | elapsed time per iteration (s): 1.05 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.365741E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.696 | TFLOPs: 40.11 | 15: iteration 7610/ 125429 | consumed samples: 1948160 | consumed tokens: 3989831680 | elapsed time per iteration (s): 1.05 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.414824E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.417 | TFLOPs: 40.23 | 15: iteration 7620/ 125429 | consumed samples: 1950720 | consumed tokens: 3995074560 | elapsed time per iteration (s): 1.06 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.340507E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.391 | TFLOPs: 39.89 | 15: iteration 7630/ 125429 | consumed samples: 1953280 | consumed tokens: 4000317440 | elapsed time per iteration (s): 1.07 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.359977E+00 | grad norm: 0.201 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.340 | TFLOPs: 39.72 | 15: iteration 7640/ 125429 | consumed samples: 1955840 | consumed tokens: 4005560320 | elapsed time per iteration (s): 1.04 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.386455E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.368 | TFLOPs: 40.71 | 15: iteration 7650/ 125429 | consumed samples: 1958400 | consumed tokens: 4010803200 | elapsed time per iteration (s): 1.08 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.385065E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.777 | TFLOPs: 39.13 | 15: iteration 7660/ 125429 | consumed samples: 1960960 | consumed tokens: 4016046080 | elapsed time per iteration (s): 1.05 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.335602E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.445 | TFLOPs: 40.40 | 15: iteration 7670/ 125429 | consumed samples: 1963520 | consumed tokens: 4021288960 | elapsed time per iteration (s): 1.07 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.372053E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.019 | TFLOPs: 39.66 | 15: iteration 7680/ 125429 | consumed samples: 1966080 | consumed tokens: 4026531840 | elapsed time per iteration (s): 1.04 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.354946E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.087 | TFLOPs: 40.67 | 15: iteration 7690/ 125429 | consumed samples: 1968640 | consumed tokens: 4031774720 | elapsed time per iteration (s): 1.02 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.386822E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.998 | TFLOPs: 41.31 | 15: iteration 7700/ 125429 | consumed samples: 1971200 | consumed tokens: 4037017600 | elapsed time per iteration (s): 1.02 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.365483E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.029 | TFLOPs: 41.32 | 15: iteration 7710/ 125429 | consumed samples: 1973760 | consumed tokens: 4042260480 | elapsed time per iteration (s): 1.08 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.362642E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.999 | TFLOPs: 39.33 | 15: iteration 7720/ 125429 | consumed samples: 1976320 | consumed tokens: 4047503360 | elapsed time per iteration (s): 1.05 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.367798E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.163 | TFLOPs: 40.35 | 15: iteration 7730/ 125429 | consumed samples: 1978880 | consumed tokens: 4052746240 | elapsed time per iteration (s): 1.03 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.390860E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.996 | TFLOPs: 41.15 | 15: iteration 7740/ 125429 | consumed samples: 1981440 | consumed tokens: 4057989120 | elapsed time per iteration (s): 1.03 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.371954E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.662 | TFLOPs: 41.09 | 15: iteration 7750/ 125429 | consumed samples: 1984000 | consumed tokens: 4063232000 | elapsed time per iteration (s): 1.07 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.371114E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.331 | TFLOPs: 39.39 | 15: iteration 7760/ 125429 | consumed samples: 1986560 | consumed tokens: 4068474880 | elapsed time per iteration (s): 1.04 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.378928E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.672 | TFLOPs: 40.76 | 15: iteration 7770/ 125429 | consumed samples: 1989120 | consumed tokens: 4073717760 | elapsed time per iteration (s): 1.07 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.364859E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.200 | TFLOPs: 39.69 | 15: iteration 7780/ 125429 | consumed samples: 1991680 | consumed tokens: 4078960640 | elapsed time per iteration (s): 1.05 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.404631E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.049 | TFLOPs: 40.17 | 15: iteration 7790/ 125429 | consumed samples: 1994240 | consumed tokens: 4084203520 | elapsed time per iteration (s): 1.04 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.358829E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.985 | TFLOPs: 40.49 | 15: iteration 7800/ 125429 | consumed samples: 1996800 | consumed tokens: 4089446400 | elapsed time per iteration (s): 1.06 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.354717E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.843 | TFLOPs: 39.97 | 15: iteration 7810/ 125429 | consumed samples: 1999360 | consumed tokens: 4094689280 | elapsed time per iteration (s): 1.05 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.347880E+00 | grad norm: 0.230 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.061 | TFLOPs: 40.17 | 15: iteration 7820/ 125429 | consumed samples: 2001920 | consumed tokens: 4099932160 | elapsed time per iteration (s): 1.02 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.394502E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.685 | TFLOPs: 41.43 | 15: iteration 7830/ 125429 | consumed samples: 2004480 | consumed tokens: 4105175040 | elapsed time per iteration (s): 1.03 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.349662E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.404 | TFLOPs: 41.05 | 15: iteration 7840/ 125429 | consumed samples: 2007040 | consumed tokens: 4110417920 | elapsed time per iteration (s): 1.05 | learning rate: 1.988E-04 | global batch size: 256 | lm loss: 2.345382E+00 | grad norm: 0.260 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.859 | TFLOPs: 40.30 | 15: iteration 7850/ 125429 | consumed samples: 2009600 | consumed tokens: 4115660800 | elapsed time per iteration (s): 1.08 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.384960E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.128 | TFLOPs: 39.02 | 15: iteration 7860/ 125429 | consumed samples: 2012160 | consumed tokens: 4120903680 | elapsed time per iteration (s): 1.04 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.348106E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.439 | TFLOPs: 40.73 | 15: iteration 7870/ 125429 | consumed samples: 2014720 | consumed tokens: 4126146560 | elapsed time per iteration (s): 1.05 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.372031E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.632 | TFLOPs: 40.43 | 15: iteration 7880/ 125429 | consumed samples: 2017280 | consumed tokens: 4131389440 | elapsed time per iteration (s): 1.04 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.352029E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.806 | TFLOPs: 40.62 | 15: iteration 7890/ 125429 | consumed samples: 2019840 | consumed tokens: 4136632320 | elapsed time per iteration (s): 1.08 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.356902E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.268 | TFLOPs: 39.21 | 15: iteration 7900/ 125429 | consumed samples: 2022400 | consumed tokens: 4141875200 | elapsed time per iteration (s): 1.06 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.332466E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.445 | TFLOPs: 39.90 | 15: iteration 7910/ 125429 | consumed samples: 2024960 | consumed tokens: 4147118080 | elapsed time per iteration (s): 1.06 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.378997E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.871 | TFLOPs: 39.81 | 15: iteration 7920/ 125429 | consumed samples: 2027520 | consumed tokens: 4152360960 | elapsed time per iteration (s): 1.04 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.342395E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.119 | TFLOPs: 40.84 | 15: iteration 7930/ 125429 | consumed samples: 2030080 | consumed tokens: 4157603840 | elapsed time per iteration (s): 1.02 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.384918E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.446 | TFLOPs: 41.39 | 15: iteration 7940/ 125429 | consumed samples: 2032640 | consumed tokens: 4162846720 | elapsed time per iteration (s): 1.05 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.327900E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.762 | TFLOPs: 40.12 | 15: iteration 7950/ 125429 | consumed samples: 2035200 | consumed tokens: 4168089600 | elapsed time per iteration (s): 1.03 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.330933E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.242 | TFLOPs: 41.19 | 15: iteration 7960/ 125429 | consumed samples: 2037760 | consumed tokens: 4173332480 | elapsed time per iteration (s): 1.03 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.400970E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.561 | TFLOPs: 41.08 | 15: iteration 7970/ 125429 | consumed samples: 2040320 | consumed tokens: 4178575360 | elapsed time per iteration (s): 1.06 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.365387E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.953 | TFLOPs: 39.98 | 15: iteration 7980/ 125429 | consumed samples: 2042880 | consumed tokens: 4183818240 | elapsed time per iteration (s): 1.04 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.356128E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.775 | TFLOPs: 40.78 | 15: iteration 7990/ 125429 | consumed samples: 2045440 | consumed tokens: 4189061120 | elapsed time per iteration (s): 1.04 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.389688E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.865 | TFLOPs: 40.63 | 0: [2022-11-25 22:07:32,834] [INFO] [logging.py:68:log_dist] [Rank 0] step=8000, skipped=0, lr=[0.00019869248521428066, 0.00019869248521428066, 0.00019869248521428066], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 8000/ 125429 | consumed samples: 2048000 | consumed tokens: 4194304000 | elapsed time per iteration (s): 1.07 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.355148E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.183 | TFLOPs: 39.53 | 0: steps: 8000 loss: 2.3512 iter time (s): 1.047 samples/sec: 244.564 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 8000 | lm loss value: 2.344799E+00 | lm loss PPL: 1.043117E+01 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 8000 to checkpoints_1b5 0: [2022-11-25 22:07:33,180] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step8000 is begin to save! 0: [2022-11-25 22:07:33,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:07:33,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:07:33,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:07:33,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:07:33,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:07:33,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:07:33,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:07:33,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:07:33,735] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:07:33,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:07:33,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:07:33,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:07:33,945] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:07:34,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:07:34,050] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:07:34,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:07:34,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:07:34,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:07:34,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:07:34,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:07:34,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:07:34,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:07:34,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:07:34,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:07:34,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:07:34,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:07:34,681] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:07:34,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:07:34,788] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:07:34,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:07:34,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:07:35,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:07:35,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:07:35,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:07:35,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:07:35,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:07:35,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:07:35,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:07:35,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:07:35,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:07:35,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:07:35,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:07:35,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:07:35,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:07:35,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:07:35,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:07:35,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:07:35,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:07:35,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:07:35,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:07:35,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:07:36,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:07:36,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:07:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:07:36,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_29-model_00-model_states.pt... 0: [2022-11-25 22:07:36,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_29-model_00-model_states.pt. 0: [2022-11-25 22:07:36,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:07:36,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:07:36,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/layer_32-model_00-model_states.pt... 0: [2022-11-25 22:07:36,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/layer_32-model_00-model_states.pt. 0: [2022-11-25 22:07:36,380] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step8000/mp_rank_00_model_states.pt 0: [2022-11-25 22:07:36,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:07:36,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:07:36,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step8000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:07:36,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:36,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:36,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 22:07:36,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:36,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 22:07:36,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 22:07:36,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:36,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:36,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:36,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:36,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:36,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:36,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 22:07:36,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 22:07:36,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:36,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:36,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:36,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:36,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:36,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:36,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:36,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:36,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:36,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:36,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:36,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:36,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 22:07:36,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:36,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:36,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 22:07:36,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:36,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:36,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:36,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:36,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:36,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:36,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 22:07:36,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:36,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 22:07:36,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:36,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:36,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 22:07:36,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 22:07:36,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 22:07:36,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 9: [2022-11-25 22:07:36,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:07:36,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 22:07:36,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:07:36,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 22:07:36,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:36,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:36,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:36,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 10: [2022-11-25 22:07:36,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:07:36,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:07:36,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 10: [2022-11-25 22:07:36,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 22:07:36,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:36,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:36,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:36,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 13: [2022-11-25 22:07:36,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:07:36,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 22:07:36,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: [2022-11-25 22:07:36,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:07:36,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 2: [2022-11-25 22:07:36,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:07:36,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:07:36,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 14: [2022-11-25 22:07:36,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 22:07:36,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 22:07:36,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 4: [2022-11-25 22:07:36,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:07:36,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 22:07:36,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:36,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:36,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:36,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:36,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 11: [2022-11-25 22:07:36,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:07:36,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 22:07:36,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 12: [2022-11-25 22:07:36,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:07:36,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 22:07:36,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 15: [2022-11-25 22:07:36,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:07:36,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 22:07:36,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 6: [2022-11-25 22:07:36,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:07:36,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 22:07:36,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 7: [2022-11-25 22:07:36,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:07:36,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 22:07:36,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:36,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 3: [2022-11-25 22:07:36,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:07:36,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:07:36,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 5: [2022-11-25 22:07:36,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:07:36,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 22:07:36,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:07:36,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 22:07:36,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 8: [2022-11-25 22:07:36,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 1: [2022-11-25 22:07:36,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:07:36,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step8000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 22:07:36,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step8000 is ready now! 0: successfully saved checkpoint at iteration 8000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3652.59 15: iteration 8010/ 125429 | consumed samples: 2050560 | consumed tokens: 4199546880 | elapsed time per iteration (s): 1.45 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.381443E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.355 | TFLOPs: 29.14 | 15: iteration 8020/ 125429 | consumed samples: 2053120 | consumed tokens: 4204789760 | elapsed time per iteration (s): 2.96 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.349688E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 86.456 | TFLOPs: 14.29 | 15: iteration 8030/ 125429 | consumed samples: 2055680 | consumed tokens: 4210032640 | elapsed time per iteration (s): 1.04 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.343285E+00 | grad norm: 0.554 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.801 | TFLOPs: 40.79 | 15: iteration 8040/ 125429 | consumed samples: 2058240 | consumed tokens: 4215275520 | elapsed time per iteration (s): 1.03 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.353644E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.894 | TFLOPs: 40.97 | 15: iteration 8050/ 125429 | consumed samples: 2060800 | consumed tokens: 4220518400 | elapsed time per iteration (s): 1.09 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.340998E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.869 | TFLOPs: 38.81 | 15: iteration 8060/ 125429 | consumed samples: 2063360 | consumed tokens: 4225761280 | elapsed time per iteration (s): 1.03 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.383616E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.710 | TFLOPs: 41.10 | 15: iteration 8070/ 125429 | consumed samples: 2065920 | consumed tokens: 4231004160 | elapsed time per iteration (s): 1.05 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.342100E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.601 | TFLOPs: 40.26 | 15: iteration 8080/ 125429 | consumed samples: 2068480 | consumed tokens: 4236247040 | elapsed time per iteration (s): 1.02 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.340658E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.940 | TFLOPs: 41.30 | 15: iteration 8090/ 125429 | consumed samples: 2071040 | consumed tokens: 4241489920 | elapsed time per iteration (s): 1.03 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.333545E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.043 | TFLOPs: 40.99 | 15: iteration 8100/ 125429 | consumed samples: 2073600 | consumed tokens: 4246732800 | elapsed time per iteration (s): 1.05 | learning rate: 1.987E-04 | global batch size: 256 | lm loss: 2.354166E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.639 | TFLOPs: 40.43 | 15: iteration 8110/ 125429 | consumed samples: 2076160 | consumed tokens: 4251975680 | elapsed time per iteration (s): 1.04 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.348911E+00 | grad norm: 0.226 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.342 | TFLOPs: 40.71 | 15: iteration 8120/ 125429 | consumed samples: 2078720 | consumed tokens: 4257218560 | elapsed time per iteration (s): 1.02 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.353625E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.203 | TFLOPs: 41.35 | 15: iteration 8130/ 125429 | consumed samples: 2081280 | consumed tokens: 4262461440 | elapsed time per iteration (s): 1.04 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.349817E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.152 | TFLOPs: 40.68 | 15: iteration 8140/ 125429 | consumed samples: 2083840 | consumed tokens: 4267704320 | elapsed time per iteration (s): 1.05 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.359622E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.648 | TFLOPs: 40.43 | 15: iteration 8150/ 125429 | consumed samples: 2086400 | consumed tokens: 4272947200 | elapsed time per iteration (s): 1.03 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.343233E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.748 | TFLOPs: 41.11 | 15: iteration 8160/ 125429 | consumed samples: 2088960 | consumed tokens: 4278190080 | elapsed time per iteration (s): 1.05 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.361814E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.459 | TFLOPs: 40.23 | 15: iteration 8170/ 125429 | consumed samples: 2091520 | consumed tokens: 4283432960 | elapsed time per iteration (s): 1.04 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.344003E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.082 | TFLOPs: 40.67 | 15: iteration 8180/ 125429 | consumed samples: 2094080 | consumed tokens: 4288675840 | elapsed time per iteration (s): 1.07 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.339439E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.607 | TFLOPs: 39.43 | 15: iteration 8190/ 125429 | consumed samples: 2096640 | consumed tokens: 4293918720 | elapsed time per iteration (s): 1.06 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.374050E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.082 | TFLOPs: 39.84 | 15: iteration 8200/ 125429 | consumed samples: 2099200 | consumed tokens: 4299161600 | elapsed time per iteration (s): 1.06 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.409006E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.196 | TFLOPs: 39.86 | 15: iteration 8210/ 125429 | consumed samples: 2101760 | consumed tokens: 4304404480 | elapsed time per iteration (s): 1.06 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.364177E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.591 | TFLOPs: 39.92 | 15: iteration 8220/ 125429 | consumed samples: 2104320 | consumed tokens: 4309647360 | elapsed time per iteration (s): 1.05 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.352079E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.733 | TFLOPs: 40.28 | 15: iteration 8230/ 125429 | consumed samples: 2106880 | consumed tokens: 4314890240 | elapsed time per iteration (s): 1.03 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.321708E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.771 | TFLOPs: 40.95 | 15: iteration 8240/ 125429 | consumed samples: 2109440 | consumed tokens: 4320133120 | elapsed time per iteration (s): 1.07 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.347837E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.985 | TFLOPs: 39.49 | 15: iteration 8250/ 125429 | consumed samples: 2112000 | consumed tokens: 4325376000 | elapsed time per iteration (s): 1.02 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.375810E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.835 | TFLOPs: 41.45 | 15: iteration 8260/ 125429 | consumed samples: 2114560 | consumed tokens: 4330618880 | elapsed time per iteration (s): 1.02 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.355879E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.066 | TFLOPs: 41.49 | 15: iteration 8270/ 125429 | consumed samples: 2117120 | consumed tokens: 4335861760 | elapsed time per iteration (s): 1.06 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.350747E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.490 | TFLOPs: 39.74 | 15: iteration 8280/ 125429 | consumed samples: 2119680 | consumed tokens: 4341104640 | elapsed time per iteration (s): 1.05 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.344653E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.504 | TFLOPs: 40.41 | 15: iteration 8290/ 125429 | consumed samples: 2122240 | consumed tokens: 4346347520 | elapsed time per iteration (s): 1.05 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.390276E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.754 | TFLOPs: 40.28 | 15: iteration 8300/ 125429 | consumed samples: 2124800 | consumed tokens: 4351590400 | elapsed time per iteration (s): 1.03 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.362851E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.751 | TFLOPs: 41.27 | 15: iteration 8310/ 125429 | consumed samples: 2127360 | consumed tokens: 4356833280 | elapsed time per iteration (s): 1.05 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.341362E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.778 | TFLOPs: 40.45 | 15: iteration 8320/ 125429 | consumed samples: 2129920 | consumed tokens: 4362076160 | elapsed time per iteration (s): 1.03 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.383228E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.754 | TFLOPs: 41.11 | 15: iteration 8330/ 125429 | consumed samples: 2132480 | consumed tokens: 4367319040 | elapsed time per iteration (s): 1.03 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.376267E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.838 | TFLOPs: 41.12 | 15: iteration 8340/ 125429 | consumed samples: 2135040 | consumed tokens: 4372561920 | elapsed time per iteration (s): 1.03 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.376748E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.349 | TFLOPs: 41.04 | 15: iteration 8350/ 125429 | consumed samples: 2137600 | consumed tokens: 4377804800 | elapsed time per iteration (s): 1.05 | learning rate: 1.986E-04 | global batch size: 256 | lm loss: 2.356102E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.522 | TFLOPs: 40.41 | 15: iteration 8360/ 125429 | consumed samples: 2140160 | consumed tokens: 4383047680 | elapsed time per iteration (s): 1.03 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.362097E+00 | grad norm: 0.246 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.019 | TFLOPs: 40.99 | 15: iteration 8370/ 125429 | consumed samples: 2142720 | consumed tokens: 4388290560 | elapsed time per iteration (s): 1.06 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.322451E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.652 | TFLOPs: 39.93 | 15: iteration 8380/ 125429 | consumed samples: 2145280 | consumed tokens: 4393533440 | elapsed time per iteration (s): 1.13 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.349563E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.557 | TFLOPs: 37.28 | 15: iteration 8390/ 125429 | consumed samples: 2147840 | consumed tokens: 4398776320 | elapsed time per iteration (s): 1.04 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.344206E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.118 | TFLOPs: 40.51 | 15: iteration 8400/ 125429 | consumed samples: 2150400 | consumed tokens: 4404019200 | elapsed time per iteration (s): 1.06 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.325971E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.503 | TFLOPs: 39.75 | 15: iteration 8410/ 125429 | consumed samples: 2152960 | consumed tokens: 4409262080 | elapsed time per iteration (s): 1.06 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.311617E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.883 | TFLOPs: 39.81 | 15: iteration 8420/ 125429 | consumed samples: 2155520 | consumed tokens: 4414504960 | elapsed time per iteration (s): 1.05 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.333598E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.332 | TFLOPs: 40.38 | 15: iteration 8430/ 125429 | consumed samples: 2158080 | consumed tokens: 4419747840 | elapsed time per iteration (s): 1.05 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.363456E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.531 | TFLOPs: 40.25 | 15: iteration 8440/ 125429 | consumed samples: 2160640 | consumed tokens: 4424990720 | elapsed time per iteration (s): 1.05 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.341754E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.307 | TFLOPs: 40.37 | 15: iteration 8450/ 125429 | consumed samples: 2163200 | consumed tokens: 4430233600 | elapsed time per iteration (s): 1.02 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.352172E+00 | grad norm: 0.278 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.980 | TFLOPs: 41.48 | 15: iteration 8460/ 125429 | consumed samples: 2165760 | consumed tokens: 4435476480 | elapsed time per iteration (s): 1.04 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.373139E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.893 | TFLOPs: 40.80 | 15: iteration 8470/ 125429 | consumed samples: 2168320 | consumed tokens: 4440719360 | elapsed time per iteration (s): 1.08 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.334894E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.757 | TFLOPs: 39.13 | 15: iteration 8480/ 125429 | consumed samples: 2170880 | consumed tokens: 4445962240 | elapsed time per iteration (s): 1.06 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.331434E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.449 | TFLOPs: 40.07 | 15: iteration 8490/ 125429 | consumed samples: 2173440 | consumed tokens: 4451205120 | elapsed time per iteration (s): 1.03 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.356958E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.932 | TFLOPs: 40.97 | 15: iteration 8500/ 125429 | consumed samples: 2176000 | consumed tokens: 4456448000 | elapsed time per iteration (s): 1.05 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.359311E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.915 | TFLOPs: 40.14 | 15: iteration 8510/ 125429 | consumed samples: 2178560 | consumed tokens: 4461690880 | elapsed time per iteration (s): 1.03 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.298185E+00 | grad norm: 0.220 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.656 | TFLOPs: 40.93 | 15: iteration 8520/ 125429 | consumed samples: 2181120 | consumed tokens: 4466933760 | elapsed time per iteration (s): 1.04 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.322386E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.192 | TFLOPs: 40.85 | 15: iteration 8530/ 125429 | consumed samples: 2183680 | consumed tokens: 4472176640 | elapsed time per iteration (s): 1.07 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.325436E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.541 | TFLOPs: 39.42 | 15: iteration 8540/ 125429 | consumed samples: 2186240 | consumed tokens: 4477419520 | elapsed time per iteration (s): 1.06 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.336279E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.946 | TFLOPs: 39.98 | 15: iteration 8550/ 125429 | consumed samples: 2188800 | consumed tokens: 4482662400 | elapsed time per iteration (s): 1.04 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.333363E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.755 | TFLOPs: 40.78 | 15: iteration 8560/ 125429 | consumed samples: 2191360 | consumed tokens: 4487905280 | elapsed time per iteration (s): 1.03 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.341337E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.738 | TFLOPs: 40.94 | 15: iteration 8570/ 125429 | consumed samples: 2193920 | consumed tokens: 4493148160 | elapsed time per iteration (s): 1.06 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.337113E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.638 | TFLOPs: 39.93 | 15: iteration 8580/ 125429 | consumed samples: 2196480 | consumed tokens: 4498391040 | elapsed time per iteration (s): 1.08 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.338629E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.853 | TFLOPs: 39.14 | 15: iteration 8590/ 125429 | consumed samples: 2199040 | consumed tokens: 4503633920 | elapsed time per iteration (s): 1.08 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.365676E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.151 | TFLOPs: 39.03 | 15: iteration 8600/ 125429 | consumed samples: 2201600 | consumed tokens: 4508876800 | elapsed time per iteration (s): 1.11 | learning rate: 1.985E-04 | global batch size: 256 | lm loss: 2.353754E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.800 | TFLOPs: 38.14 | 15: iteration 8610/ 125429 | consumed samples: 2204160 | consumed tokens: 4514119680 | elapsed time per iteration (s): 1.04 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.328748E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.591 | TFLOPs: 40.59 | 15: iteration 8620/ 125429 | consumed samples: 2206720 | consumed tokens: 4519362560 | elapsed time per iteration (s): 1.04 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.364904E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.283 | TFLOPs: 40.87 | 15: iteration 8630/ 125429 | consumed samples: 2209280 | consumed tokens: 4524605440 | elapsed time per iteration (s): 1.09 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.321790E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.494 | TFLOPs: 38.92 | 15: iteration 8640/ 125429 | consumed samples: 2211840 | consumed tokens: 4529848320 | elapsed time per iteration (s): 1.24 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.313931E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 206.517 | TFLOPs: 34.13 | 15: iteration 8650/ 125429 | consumed samples: 2214400 | consumed tokens: 4535091200 | elapsed time per iteration (s): 1.07 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.306645E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.272 | TFLOPs: 39.54 | 15: iteration 8660/ 125429 | consumed samples: 2216960 | consumed tokens: 4540334080 | elapsed time per iteration (s): 1.05 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.340159E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.487 | TFLOPs: 40.40 | 15: iteration 8670/ 125429 | consumed samples: 2219520 | consumed tokens: 4545576960 | elapsed time per iteration (s): 1.05 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.329381E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.252 | TFLOPs: 40.20 | 15: iteration 8680/ 125429 | consumed samples: 2222080 | consumed tokens: 4550819840 | elapsed time per iteration (s): 1.05 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.329564E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.067 | TFLOPs: 40.33 | 15: iteration 8690/ 125429 | consumed samples: 2224640 | consumed tokens: 4556062720 | elapsed time per iteration (s): 1.11 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.319048E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.780 | TFLOPs: 38.14 | 15: iteration 8700/ 125429 | consumed samples: 2227200 | consumed tokens: 4561305600 | elapsed time per iteration (s): 1.10 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.328130E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.577 | TFLOPs: 38.60 | 15: iteration 8710/ 125429 | consumed samples: 2229760 | consumed tokens: 4566548480 | elapsed time per iteration (s): 1.05 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.312982E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.077 | TFLOPs: 40.34 | 15: iteration 8720/ 125429 | consumed samples: 2232320 | consumed tokens: 4571791360 | elapsed time per iteration (s): 1.04 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.312995E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.465 | TFLOPs: 40.57 | 15: iteration 8730/ 125429 | consumed samples: 2234880 | consumed tokens: 4577034240 | elapsed time per iteration (s): 1.05 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.354786E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.886 | TFLOPs: 40.30 | 15: iteration 8740/ 125429 | consumed samples: 2237440 | consumed tokens: 4582277120 | elapsed time per iteration (s): 1.03 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.301544E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.646 | TFLOPs: 40.93 | 15: iteration 8750/ 125429 | consumed samples: 2240000 | consumed tokens: 4587520000 | elapsed time per iteration (s): 1.03 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.342788E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.110 | TFLOPs: 41.17 | 15: iteration 8760/ 125429 | consumed samples: 2242560 | consumed tokens: 4592762880 | elapsed time per iteration (s): 1.07 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.331400E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.780 | TFLOPs: 39.63 | 15: iteration 8770/ 125429 | consumed samples: 2245120 | consumed tokens: 4598005760 | elapsed time per iteration (s): 1.06 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.304289E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.257 | TFLOPs: 39.87 | 15: iteration 8780/ 125429 | consumed samples: 2247680 | consumed tokens: 4603248640 | elapsed time per iteration (s): 1.07 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.321809E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.210 | TFLOPs: 39.70 | 15: iteration 8790/ 125429 | consumed samples: 2250240 | consumed tokens: 4608491520 | elapsed time per iteration (s): 1.03 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.312299E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.287 | TFLOPs: 41.03 | 15: iteration 8800/ 125429 | consumed samples: 2252800 | consumed tokens: 4613734400 | elapsed time per iteration (s): 1.04 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.334965E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.521 | TFLOPs: 40.74 | 15: iteration 8810/ 125429 | consumed samples: 2255360 | consumed tokens: 4618977280 | elapsed time per iteration (s): 1.05 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.315027E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.598 | TFLOPs: 40.42 | 15: iteration 8820/ 125429 | consumed samples: 2257920 | consumed tokens: 4624220160 | elapsed time per iteration (s): 1.04 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.334022E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.907 | TFLOPs: 40.80 | 15: iteration 8830/ 125429 | consumed samples: 2260480 | consumed tokens: 4629463040 | elapsed time per iteration (s): 1.04 | learning rate: 1.984E-04 | global batch size: 256 | lm loss: 2.334705E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.284 | TFLOPs: 40.87 | 15: iteration 8840/ 125429 | consumed samples: 2263040 | consumed tokens: 4634705920 | elapsed time per iteration (s): 1.05 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.340638E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.836 | TFLOPs: 40.30 | 15: iteration 8850/ 125429 | consumed samples: 2265600 | consumed tokens: 4639948800 | elapsed time per iteration (s): 1.08 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.327136E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.795 | TFLOPs: 39.13 | 15: iteration 8860/ 125429 | consumed samples: 2268160 | consumed tokens: 4645191680 | elapsed time per iteration (s): 1.07 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.337167E+00 | grad norm: 0.205 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.494 | TFLOPs: 39.58 | 15: iteration 8870/ 125429 | consumed samples: 2270720 | consumed tokens: 4650434560 | elapsed time per iteration (s): 1.16 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.316153E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.436 | TFLOPs: 36.59 | 15: iteration 8880/ 125429 | consumed samples: 2273280 | consumed tokens: 4655677440 | elapsed time per iteration (s): 1.08 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.328076E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.419 | TFLOPs: 39.07 | 15: iteration 8890/ 125429 | consumed samples: 2275840 | consumed tokens: 4660920320 | elapsed time per iteration (s): 1.02 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.304864E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.578 | TFLOPs: 41.41 | 15: iteration 8900/ 125429 | consumed samples: 2278400 | consumed tokens: 4666163200 | elapsed time per iteration (s): 1.08 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.334786E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.954 | TFLOPs: 39.32 | 15: iteration 8910/ 125429 | consumed samples: 2280960 | consumed tokens: 4671406080 | elapsed time per iteration (s): 1.04 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.325448E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.018 | TFLOPs: 40.82 | 15: iteration 8920/ 125429 | consumed samples: 2283520 | consumed tokens: 4676648960 | elapsed time per iteration (s): 1.05 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.306573E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.665 | TFLOPs: 40.10 | 15: iteration 8930/ 125429 | consumed samples: 2286080 | consumed tokens: 4681891840 | elapsed time per iteration (s): 1.04 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.314795E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.289 | TFLOPs: 40.87 | 15: iteration 8940/ 125429 | consumed samples: 2288640 | consumed tokens: 4687134720 | elapsed time per iteration (s): 1.04 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.313657E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.208 | TFLOPs: 40.85 | 15: iteration 8950/ 125429 | consumed samples: 2291200 | consumed tokens: 4692377600 | elapsed time per iteration (s): 1.05 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.323570E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.508 | TFLOPs: 40.41 | 15: iteration 8960/ 125429 | consumed samples: 2293760 | consumed tokens: 4697620480 | elapsed time per iteration (s): 1.03 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.314300E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.408 | TFLOPs: 41.05 | 15: iteration 8970/ 125429 | consumed samples: 2296320 | consumed tokens: 4702863360 | elapsed time per iteration (s): 1.02 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.320775E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.150 | TFLOPs: 41.34 | 15: iteration 8980/ 125429 | consumed samples: 2298880 | consumed tokens: 4708106240 | elapsed time per iteration (s): 1.05 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.303098E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.592 | TFLOPs: 40.42 | 15: iteration 8990/ 125429 | consumed samples: 2301440 | consumed tokens: 4713349120 | elapsed time per iteration (s): 1.04 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.309269E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.129 | TFLOPs: 40.67 | 15: iteration 9000/ 125429 | consumed samples: 2304000 | consumed tokens: 4718592000 | elapsed time per iteration (s): 1.05 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.315954E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.878 | TFLOPs: 40.30 | 15: ------------------------------------------------------------------------------------------ 15: valid loss at iteration 9000 | lm loss value: 2.221355E+00 | lm loss PPL: 9.219819E+00 | 15: ------------------------------------------------------------------------------------------ 0: saving checkpoint at iteration 9000 to checkpoints_1b5 0: [2022-11-25 22:25:29,418] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step9000 is begin to save! 0: [2022-11-25 22:25:29,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:25:29,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:25:29,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:25:29,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:25:29,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:25:29,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:25:29,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:25:29,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:25:29,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:25:30,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:25:30,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:25:30,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:25:30,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:25:30,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:25:30,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:25:30,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:25:30,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:25:30,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:25:30,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:25:30,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:25:30,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:25:30,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:25:30,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:25:30,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:25:30,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:25:30,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:25:30,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:25:31,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:25:31,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:25:31,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:25:31,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:25:31,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:25:31,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:25:31,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:25:31,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:25:31,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:25:31,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:25:31,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:25:31,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:25:31,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:25:31,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:25:31,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:25:31,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:25:31,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:25:31,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:25:31,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:25:31,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:25:32,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:25:32,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:25:32,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:25:32,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:25:32,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:25:32,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:25:32,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:25:32,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_29-model_00-model_states.pt... 0: [2022-11-25 22:25:32,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_29-model_00-model_states.pt. 0: [2022-11-25 22:25:32,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:25:32,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:25:32,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/layer_32-model_00-model_states.pt... 0: [2022-11-25 22:25:32,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/layer_32-model_00-model_states.pt. 0: [2022-11-25 22:25:32,562] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step9000/mp_rank_00_model_states.pt 0: [2022-11-25 22:25:32,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:25:32,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:25:32,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step9000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:25:32,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 22:25:32,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 22:25:32,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 22:25:32,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 22:25:32,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:25:32,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:25:32,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:25:32,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:25:32,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:25:32,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:25:32,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:25:32,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 22:25:32,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 22:25:32,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 22:25:32,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:25:32,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 22:25:32,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 22:25:32,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 22:25:32,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 22:25:32,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 22:25:32,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:25:32,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:25:32,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 22:25:32,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 22:25:32,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 22:25:32,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 22:25:32,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 22:25:32,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 22:25:32,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:25:32,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:25:32,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 22:25:32,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 22:25:32,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 22:25:32,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 22:25:32,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 22:25:32,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 22:25:32,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 22:25:32,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 15: [2022-11-25 22:25:32,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:25:32,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 22:25:32,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 22:25:32,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:25:32,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:25:32,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 6: [2022-11-25 22:25:32,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:25:32,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 22:25:32,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 22:25:32,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 14: [2022-11-25 22:25:32,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:25:32,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:25:32,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 22:25:32,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 14: [2022-11-25 22:25:32,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 5: [2022-11-25 22:25:32,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:25:32,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 7: [2022-11-25 22:25:32,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:25:32,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:25:32,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 9: [2022-11-25 22:25:32,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:25:32,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 22:25:32,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:25:32,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 22:25:32,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:25:32,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 22:25:32,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 22:25:32,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 4: [2022-11-25 22:25:32,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:25:32,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:25:32,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: [2022-11-25 22:25:32,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:25:32,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:25:32,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 22:25:32,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 11: [2022-11-25 22:25:32,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 22:25:32,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 2: [2022-11-25 22:25:32,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:25:32,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 22:25:32,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:32,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:32,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:32,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:32,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:32,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:32,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:32,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:32,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:32,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:32,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:32,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:32,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:32,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:32,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:32,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:32,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:32,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:32,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 22:25:32,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 22:25:32,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 22:25:32,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 13: [2022-11-25 22:25:32,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 12: [2022-11-25 22:25:32,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:25:32,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 22:25:32,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 22:25:32,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 22:25:32,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 22:25:32,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 10: [2022-11-25 22:25:32,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:25:32,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 22:25:32,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 22:25:32,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 1: [2022-11-25 22:25:32,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:25:32,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 22:25:32,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 8: [2022-11-25 22:25:32,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:25:32,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 22:25:32,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:33,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:33,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:33,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 3: [2022-11-25 22:25:33,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:25:33,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step9000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 22:25:33,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step9000 is ready now! 0: successfully saved checkpoint at iteration 9000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3645.93 15: iteration 9010/ 125429 | consumed samples: 2306560 | consumed tokens: 4723834880 | elapsed time per iteration (s): 1.47 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.333499E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.539 | TFLOPs: 28.84 | 15: iteration 9020/ 125429 | consumed samples: 2309120 | consumed tokens: 4729077760 | elapsed time per iteration (s): 1.05 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.308587E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.954 | TFLOPs: 40.32 | 15: iteration 9030/ 125429 | consumed samples: 2311680 | consumed tokens: 4734320640 | elapsed time per iteration (s): 1.11 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.338722E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.348 | TFLOPs: 38.07 | 15: iteration 9040/ 125429 | consumed samples: 2314240 | consumed tokens: 4739563520 | elapsed time per iteration (s): 1.04 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.328606E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.703 | TFLOPs: 40.77 | 15: iteration 9050/ 125429 | consumed samples: 2316800 | consumed tokens: 4744806400 | elapsed time per iteration (s): 1.05 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.343776E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.415 | TFLOPs: 40.39 | 15: iteration 9060/ 125429 | consumed samples: 2319360 | consumed tokens: 4750049280 | elapsed time per iteration (s): 1.08 | learning rate: 1.983E-04 | global batch size: 256 | lm loss: 2.339422E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.810 | TFLOPs: 39.30 | 15: iteration 9070/ 125429 | consumed samples: 2321920 | consumed tokens: 4755292160 | elapsed time per iteration (s): 1.09 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.324071E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.181 | TFLOPs: 38.70 | 15: iteration 9080/ 125429 | consumed samples: 2324480 | consumed tokens: 4760535040 | elapsed time per iteration (s): 1.03 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.349616E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.771 | TFLOPs: 40.95 | 15: iteration 9090/ 125429 | consumed samples: 2327040 | consumed tokens: 4765777920 | elapsed time per iteration (s): 1.05 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.308586E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.539 | TFLOPs: 40.41 | 15: iteration 9100/ 125429 | consumed samples: 2329600 | consumed tokens: 4771020800 | elapsed time per iteration (s): 1.11 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.334917E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.503 | TFLOPs: 38.09 | 15: iteration 9110/ 125429 | consumed samples: 2332160 | consumed tokens: 4776263680 | elapsed time per iteration (s): 1.07 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.302777E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.183 | TFLOPs: 39.53 | 15: iteration 9120/ 125429 | consumed samples: 2334720 | consumed tokens: 4781506560 | elapsed time per iteration (s): 1.04 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.326863E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.189 | TFLOPs: 40.85 | 15: iteration 9130/ 125429 | consumed samples: 2337280 | consumed tokens: 4786749440 | elapsed time per iteration (s): 1.05 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.322400E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.829 | TFLOPs: 40.46 | 15: iteration 9140/ 125429 | consumed samples: 2339840 | consumed tokens: 4791992320 | elapsed time per iteration (s): 1.04 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.315656E+00 | grad norm: 0.405 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.693 | TFLOPs: 40.60 | 15: iteration 9150/ 125429 | consumed samples: 2342400 | consumed tokens: 4797235200 | elapsed time per iteration (s): 1.09 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.710644E+00 | grad norm: 3.089 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.530 | TFLOPs: 38.92 | 15: iteration 9160/ 125429 | consumed samples: 2344960 | consumed tokens: 4802478080 | elapsed time per iteration (s): 1.05 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.608184E+00 | grad norm: 0.992 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.257 | TFLOPs: 40.37 | 15: iteration 9170/ 125429 | consumed samples: 2347520 | consumed tokens: 4807720960 | elapsed time per iteration (s): 1.04 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.605718E+00 | grad norm: 0.733 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.985 | TFLOPs: 40.82 | 15: iteration 9180/ 125429 | consumed samples: 2350080 | consumed tokens: 4812963840 | elapsed time per iteration (s): 1.09 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.466282E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.681 | TFLOPs: 38.95 | 15: iteration 9190/ 125429 | consumed samples: 2352640 | consumed tokens: 4818206720 | elapsed time per iteration (s): 1.05 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.403138E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.435 | TFLOPs: 40.39 | 15: iteration 9200/ 125429 | consumed samples: 2355200 | consumed tokens: 4823449600 | elapsed time per iteration (s): 1.04 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.365140E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.358 | TFLOPs: 40.55 | 15: iteration 9210/ 125429 | consumed samples: 2357760 | consumed tokens: 4828692480 | elapsed time per iteration (s): 1.06 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.348307E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.183 | TFLOPs: 40.02 | 15: iteration 9220/ 125429 | consumed samples: 2360320 | consumed tokens: 4833935360 | elapsed time per iteration (s): 1.03 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.299304E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.867 | TFLOPs: 40.96 | 15: iteration 9230/ 125429 | consumed samples: 2362880 | consumed tokens: 4839178240 | elapsed time per iteration (s): 1.03 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.355075E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.518 | TFLOPs: 40.90 | 15: iteration 9240/ 125429 | consumed samples: 2365440 | consumed tokens: 4844421120 | elapsed time per iteration (s): 1.04 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.337190E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.212 | TFLOPs: 40.52 | 15: iteration 9250/ 125429 | consumed samples: 2368000 | consumed tokens: 4849664000 | elapsed time per iteration (s): 1.05 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.330148E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.429 | TFLOPs: 40.39 | 15: iteration 9260/ 125429 | consumed samples: 2370560 | consumed tokens: 4854906880 | elapsed time per iteration (s): 1.05 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.358901E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.985 | TFLOPs: 40.32 | 15: iteration 9270/ 125429 | consumed samples: 2373120 | consumed tokens: 4860149760 | elapsed time per iteration (s): 1.06 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.345176E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.504 | TFLOPs: 39.91 | 15: iteration 9280/ 125429 | consumed samples: 2375680 | consumed tokens: 4865392640 | elapsed time per iteration (s): 1.06 | learning rate: 1.982E-04 | global batch size: 256 | lm loss: 2.344360E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.749 | TFLOPs: 39.79 | 15: iteration 9290/ 125429 | consumed samples: 2378240 | consumed tokens: 4870635520 | elapsed time per iteration (s): 1.06 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.328873E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.537 | TFLOPs: 39.92 | 15: iteration 9300/ 125429 | consumed samples: 2380800 | consumed tokens: 4875878400 | elapsed time per iteration (s): 1.05 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.312926E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.450 | TFLOPs: 40.40 | 15: iteration 9310/ 125429 | consumed samples: 2383360 | consumed tokens: 4881121280 | elapsed time per iteration (s): 1.06 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.329465E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.473 | TFLOPs: 40.07 | 15: iteration 9320/ 125429 | consumed samples: 2385920 | consumed tokens: 4886364160 | elapsed time per iteration (s): 1.07 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.326059E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.787 | TFLOPs: 39.63 | 15: iteration 9330/ 125429 | consumed samples: 2388480 | consumed tokens: 4891607040 | elapsed time per iteration (s): 1.03 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.353717E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.549 | TFLOPs: 41.24 | 15: iteration 9340/ 125429 | consumed samples: 2391040 | consumed tokens: 4896849920 | elapsed time per iteration (s): 1.04 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.336661E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.935 | TFLOPs: 40.81 | 15: iteration 9350/ 125429 | consumed samples: 2393600 | consumed tokens: 4902092800 | elapsed time per iteration (s): 1.05 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.337102E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.977 | TFLOPs: 40.32 | 15: iteration 9360/ 125429 | consumed samples: 2396160 | consumed tokens: 4907335680 | elapsed time per iteration (s): 1.03 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.294192E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.707 | TFLOPs: 40.94 | 15: iteration 9370/ 125429 | consumed samples: 2398720 | consumed tokens: 4912578560 | elapsed time per iteration (s): 1.03 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.316108E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.682 | TFLOPs: 41.26 | 15: iteration 9380/ 125429 | consumed samples: 2401280 | consumed tokens: 4917821440 | elapsed time per iteration (s): 1.02 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.333578E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.376 | TFLOPs: 41.54 | 15: iteration 9390/ 125429 | consumed samples: 2403840 | consumed tokens: 4923064320 | elapsed time per iteration (s): 1.04 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.300900E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.684 | TFLOPs: 40.77 | 15: iteration 9400/ 125429 | consumed samples: 2406400 | consumed tokens: 4928307200 | elapsed time per iteration (s): 1.04 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.326499E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.325 | TFLOPs: 40.87 | 15: iteration 9410/ 125429 | consumed samples: 2408960 | consumed tokens: 4933550080 | elapsed time per iteration (s): 1.08 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.281841E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.158 | TFLOPs: 39.19 | 15: iteration 9420/ 125429 | consumed samples: 2411520 | consumed tokens: 4938792960 | elapsed time per iteration (s): 1.05 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.338991E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.472 | TFLOPs: 40.40 | 15: iteration 9430/ 125429 | consumed samples: 2414080 | consumed tokens: 4944035840 | elapsed time per iteration (s): 1.03 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.306700E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.619 | TFLOPs: 41.09 | 15: iteration 9440/ 125429 | consumed samples: 2416640 | consumed tokens: 4949278720 | elapsed time per iteration (s): 1.09 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.329247E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.109 | TFLOPs: 38.69 | 15: iteration 9450/ 125429 | consumed samples: 2419200 | consumed tokens: 4954521600 | elapsed time per iteration (s): 1.05 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.361487E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.080 | TFLOPs: 40.17 | 15: iteration 9460/ 125429 | consumed samples: 2421760 | consumed tokens: 4959764480 | elapsed time per iteration (s): 1.03 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.333030E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.026 | TFLOPs: 41.15 | 15: iteration 9470/ 125429 | consumed samples: 2424320 | consumed tokens: 4965007360 | elapsed time per iteration (s): 1.07 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.337310E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.287 | TFLOPs: 39.71 | 15: iteration 9480/ 125429 | consumed samples: 2426880 | consumed tokens: 4970250240 | elapsed time per iteration (s): 1.04 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.289723E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.680 | TFLOPs: 40.60 | 15: iteration 9490/ 125429 | consumed samples: 2429440 | consumed tokens: 4975493120 | elapsed time per iteration (s): 1.06 | learning rate: 1.981E-04 | global batch size: 256 | lm loss: 2.293320E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.778 | TFLOPs: 39.96 | 15: iteration 9500/ 125429 | consumed samples: 2432000 | consumed tokens: 4980736000 | elapsed time per iteration (s): 1.02 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.322640E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.045 | TFLOPs: 41.32 | 15: iteration 9510/ 125429 | consumed samples: 2434560 | consumed tokens: 4985978880 | elapsed time per iteration (s): 1.06 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.318530E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.335 | TFLOPs: 39.88 | 15: iteration 9520/ 125429 | consumed samples: 2437120 | consumed tokens: 4991221760 | elapsed time per iteration (s): 1.04 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.299805E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.841 | TFLOPs: 40.63 | 15: iteration 9530/ 125429 | consumed samples: 2439680 | consumed tokens: 4996464640 | elapsed time per iteration (s): 1.04 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.335969E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.596 | TFLOPs: 40.75 | 15: iteration 9540/ 125429 | consumed samples: 2442240 | consumed tokens: 5001707520 | elapsed time per iteration (s): 1.06 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.333853E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.409 | TFLOPs: 39.89 | 15: iteration 9550/ 125429 | consumed samples: 2444800 | consumed tokens: 5006950400 | elapsed time per iteration (s): 1.03 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.324942E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.899 | TFLOPs: 40.97 | 15: iteration 9560/ 125429 | consumed samples: 2447360 | consumed tokens: 5012193280 | elapsed time per iteration (s): 1.05 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.320370E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.141 | TFLOPs: 40.35 | 15: iteration 9570/ 125429 | consumed samples: 2449920 | consumed tokens: 5017436160 | elapsed time per iteration (s): 1.03 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.257398E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.181 | TFLOPs: 41.18 | 15: iteration 9580/ 125429 | consumed samples: 2452480 | consumed tokens: 5022679040 | elapsed time per iteration (s): 1.04 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.318854E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.621 | TFLOPs: 40.59 | 15: iteration 9590/ 125429 | consumed samples: 2455040 | consumed tokens: 5027921920 | elapsed time per iteration (s): 1.05 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.289188E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.552 | TFLOPs: 40.41 | 15: iteration 9600/ 125429 | consumed samples: 2457600 | consumed tokens: 5033164800 | elapsed time per iteration (s): 1.03 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.298219E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.412 | TFLOPs: 41.22 | 15: iteration 9610/ 125429 | consumed samples: 2460160 | consumed tokens: 5038407680 | elapsed time per iteration (s): 1.04 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.294017E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.639 | TFLOPs: 40.59 | 15: iteration 9620/ 125429 | consumed samples: 2462720 | consumed tokens: 5043650560 | elapsed time per iteration (s): 1.05 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.349439E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.603 | TFLOPs: 40.26 | 15: iteration 9630/ 125429 | consumed samples: 2465280 | consumed tokens: 5048893440 | elapsed time per iteration (s): 1.06 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.324759E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.856 | TFLOPs: 39.97 | 15: iteration 9640/ 125429 | consumed samples: 2467840 | consumed tokens: 5054136320 | elapsed time per iteration (s): 1.03 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.282494E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.213 | TFLOPs: 41.18 | 15: iteration 9650/ 125429 | consumed samples: 2470400 | consumed tokens: 5059379200 | elapsed time per iteration (s): 1.03 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.319426E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.299 | TFLOPs: 41.03 | 15: iteration 9660/ 125429 | consumed samples: 2472960 | consumed tokens: 5064622080 | elapsed time per iteration (s): 1.05 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.321736E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.223 | TFLOPs: 40.36 | 15: iteration 9670/ 125429 | consumed samples: 2475520 | consumed tokens: 5069864960 | elapsed time per iteration (s): 1.03 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.292236E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.743 | TFLOPs: 40.94 | 15: iteration 9680/ 125429 | consumed samples: 2478080 | consumed tokens: 5075107840 | elapsed time per iteration (s): 1.05 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.294775E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.353 | TFLOPs: 40.38 | 15: iteration 9690/ 125429 | consumed samples: 2480640 | consumed tokens: 5080350720 | elapsed time per iteration (s): 1.06 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.296801E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.886 | TFLOPs: 39.81 | 15: iteration 9700/ 125429 | consumed samples: 2483200 | consumed tokens: 5085593600 | elapsed time per iteration (s): 1.05 | learning rate: 1.980E-04 | global batch size: 256 | lm loss: 2.305326E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.104 | TFLOPs: 40.17 | 15: iteration 9710/ 125429 | consumed samples: 2485760 | consumed tokens: 5090836480 | elapsed time per iteration (s): 1.05 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.299860E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.994 | TFLOPs: 40.32 | 15: iteration 9720/ 125429 | consumed samples: 2488320 | consumed tokens: 5096079360 | elapsed time per iteration (s): 1.11 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.304501E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.500 | TFLOPs: 38.26 | 15: iteration 9730/ 125429 | consumed samples: 2490880 | consumed tokens: 5101322240 | elapsed time per iteration (s): 1.07 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.350603E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.194 | TFLOPs: 39.53 | 15: iteration 9740/ 125429 | consumed samples: 2493440 | consumed tokens: 5106565120 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.320892E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.749 | TFLOPs: 41.11 | 15: iteration 9750/ 125429 | consumed samples: 2496000 | consumed tokens: 5111808000 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.298744E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.624 | TFLOPs: 41.09 | 15: iteration 9760/ 125429 | consumed samples: 2498560 | consumed tokens: 5117050880 | elapsed time per iteration (s): 1.04 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.291939E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.515 | TFLOPs: 40.57 | 15: iteration 9770/ 125429 | consumed samples: 2501120 | consumed tokens: 5122293760 | elapsed time per iteration (s): 1.04 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.306836E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.141 | TFLOPs: 40.68 | 15: iteration 9780/ 125429 | consumed samples: 2503680 | consumed tokens: 5127536640 | elapsed time per iteration (s): 1.04 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.331941E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.522 | TFLOPs: 40.57 | 15: iteration 9790/ 125429 | consumed samples: 2506240 | consumed tokens: 5132779520 | elapsed time per iteration (s): 1.04 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.291431E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.828 | TFLOPs: 40.79 | 15: iteration 9800/ 125429 | consumed samples: 2508800 | consumed tokens: 5138022400 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.319186E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.326 | TFLOPs: 41.04 | 15: iteration 9810/ 125429 | consumed samples: 2511360 | consumed tokens: 5143265280 | elapsed time per iteration (s): 1.05 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.315336E+00 | grad norm: 0.324 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.723 | TFLOPs: 40.11 | 15: iteration 9820/ 125429 | consumed samples: 2513920 | consumed tokens: 5148508160 | elapsed time per iteration (s): 1.04 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.290438E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.072 | TFLOPs: 40.83 | 15: iteration 9830/ 125429 | consumed samples: 2516480 | consumed tokens: 5153751040 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.315734E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.449 | TFLOPs: 41.06 | 15: iteration 9840/ 125429 | consumed samples: 2519040 | consumed tokens: 5158993920 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.271445E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.433 | TFLOPs: 41.22 | 15: iteration 9850/ 125429 | consumed samples: 2521600 | consumed tokens: 5164236800 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.312198E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.699 | TFLOPs: 40.93 | 15: iteration 9860/ 125429 | consumed samples: 2524160 | consumed tokens: 5169479680 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.324542E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.526 | TFLOPs: 41.07 | 15: iteration 9870/ 125429 | consumed samples: 2526720 | consumed tokens: 5174722560 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.276197E+00 | grad norm: 0.262 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.023 | TFLOPs: 40.99 | 15: iteration 9880/ 125429 | consumed samples: 2529280 | consumed tokens: 5179965440 | elapsed time per iteration (s): 1.07 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.428371E+00 | grad norm: 3.840 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.243 | TFLOPs: 39.70 | 15: iteration 9890/ 125429 | consumed samples: 2531840 | consumed tokens: 5185208320 | elapsed time per iteration (s): 1.06 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.574827E+00 | grad norm: 0.857 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.062 | TFLOPs: 40.00 | 15: iteration 9900/ 125429 | consumed samples: 2534400 | consumed tokens: 5190451200 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.506908E+00 | grad norm: 0.404 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.725 | TFLOPs: 40.94 | 15: iteration 9910/ 125429 | consumed samples: 2536960 | consumed tokens: 5195694080 | elapsed time per iteration (s): 1.03 | learning rate: 1.979E-04 | global batch size: 256 | lm loss: 2.403526E+00 | grad norm: 0.199 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.930 | TFLOPs: 40.97 | 15: iteration 9920/ 125429 | consumed samples: 2539520 | consumed tokens: 5200936960 | elapsed time per iteration (s): 1.03 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.393493E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.602 | TFLOPs: 40.92 | 15: iteration 9930/ 125429 | consumed samples: 2542080 | consumed tokens: 5206179840 | elapsed time per iteration (s): 1.04 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.349631E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.183 | TFLOPs: 40.68 | 15: iteration 9940/ 125429 | consumed samples: 2544640 | consumed tokens: 5211422720 | elapsed time per iteration (s): 1.05 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.322291E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.758 | TFLOPs: 40.12 | 15: iteration 9950/ 125429 | consumed samples: 2547200 | consumed tokens: 5216665600 | elapsed time per iteration (s): 1.08 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.314428E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.228 | TFLOPs: 39.20 | 15: iteration 9960/ 125429 | consumed samples: 2549760 | consumed tokens: 5221908480 | elapsed time per iteration (s): 1.05 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.337733E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.238 | TFLOPs: 40.20 | 15: iteration 9970/ 125429 | consumed samples: 2552320 | consumed tokens: 5227151360 | elapsed time per iteration (s): 1.04 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.294594E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.302 | TFLOPs: 40.87 | 15: iteration 9980/ 125429 | consumed samples: 2554880 | consumed tokens: 5232394240 | elapsed time per iteration (s): 1.03 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.340214E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.449 | TFLOPs: 41.22 | 15: iteration 9990/ 125429 | consumed samples: 2557440 | consumed tokens: 5237637120 | elapsed time per iteration (s): 1.03 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.336258E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.406 | TFLOPs: 40.89 | 0: [2022-11-25 22:43:01,105] [INFO] [logging.py:68:log_dist] [Rank 0] step=10000, skipped=0, lr=[0.0001978058657283659, 0.0001978058657283659, 0.0001978058657283659], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 10000/ 125429 | consumed samples: 2560000 | consumed tokens: 5242880000 | elapsed time per iteration (s): 1.04 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.342684E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.926 | TFLOPs: 40.64 | 0: steps: 10000 loss: 2.3076 iter time (s): 1.057 samples/sec: 242.102 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 10000 | lm loss value: 2.252762E+00 | lm loss PPL: 9.513973E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 10000 to checkpoints_1b5 0: [2022-11-25 22:43:01,511] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step10000 is begin to save! 0: [2022-11-25 22:43:01,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_01-model_00-model_states.pt... 0: [2022-11-25 22:43:01,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_01-model_00-model_states.pt. 0: [2022-11-25 22:43:01,998] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_03-model_00-model_states.pt... 0: [2022-11-25 22:43:02,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_03-model_00-model_states.pt. 0: [2022-11-25 22:43:02,116] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_04-model_00-model_states.pt... 0: [2022-11-25 22:43:02,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_04-model_00-model_states.pt. 0: [2022-11-25 22:43:02,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_05-model_00-model_states.pt... 0: [2022-11-25 22:43:02,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_05-model_00-model_states.pt. 0: [2022-11-25 22:43:02,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_06-model_00-model_states.pt... 0: [2022-11-25 22:43:02,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_06-model_00-model_states.pt. 0: [2022-11-25 22:43:02,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_07-model_00-model_states.pt... 0: [2022-11-25 22:43:02,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_07-model_00-model_states.pt. 0: [2022-11-25 22:43:02,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_08-model_00-model_states.pt... 0: [2022-11-25 22:43:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_08-model_00-model_states.pt. 0: [2022-11-25 22:43:02,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_09-model_00-model_states.pt... 0: [2022-11-25 22:43:02,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_09-model_00-model_states.pt. 0: [2022-11-25 22:43:02,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_10-model_00-model_states.pt... 0: [2022-11-25 22:43:02,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_10-model_00-model_states.pt. 0: [2022-11-25 22:43:02,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_11-model_00-model_states.pt... 0: [2022-11-25 22:43:03,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_11-model_00-model_states.pt. 0: [2022-11-25 22:43:03,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_12-model_00-model_states.pt... 0: [2022-11-25 22:43:03,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_12-model_00-model_states.pt. 0: [2022-11-25 22:43:03,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_13-model_00-model_states.pt... 0: [2022-11-25 22:43:03,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_13-model_00-model_states.pt. 0: [2022-11-25 22:43:03,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_14-model_00-model_states.pt... 0: [2022-11-25 22:43:03,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_14-model_00-model_states.pt. 0: [2022-11-25 22:43:03,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_15-model_00-model_states.pt... 0: [2022-11-25 22:43:03,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_15-model_00-model_states.pt. 0: [2022-11-25 22:43:03,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_16-model_00-model_states.pt... 0: [2022-11-25 22:43:03,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_16-model_00-model_states.pt. 0: [2022-11-25 22:43:03,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_17-model_00-model_states.pt... 0: [2022-11-25 22:43:03,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_17-model_00-model_states.pt. 0: [2022-11-25 22:43:03,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_18-model_00-model_states.pt... 0: [2022-11-25 22:43:03,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_18-model_00-model_states.pt. 0: [2022-11-25 22:43:03,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_19-model_00-model_states.pt... 0: [2022-11-25 22:43:03,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_19-model_00-model_states.pt. 0: [2022-11-25 22:43:03,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_20-model_00-model_states.pt... 0: [2022-11-25 22:43:04,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_20-model_00-model_states.pt. 0: [2022-11-25 22:43:04,050] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_21-model_00-model_states.pt... 0: [2022-11-25 22:43:04,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_21-model_00-model_states.pt. 0: [2022-11-25 22:43:04,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_22-model_00-model_states.pt... 0: [2022-11-25 22:43:04,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_22-model_00-model_states.pt. 0: [2022-11-25 22:43:04,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_23-model_00-model_states.pt... 0: [2022-11-25 22:43:04,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_23-model_00-model_states.pt. 0: [2022-11-25 22:43:04,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_24-model_00-model_states.pt... 0: [2022-11-25 22:43:04,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_24-model_00-model_states.pt. 0: [2022-11-25 22:43:04,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_25-model_00-model_states.pt... 0: [2022-11-25 22:43:04,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_25-model_00-model_states.pt. 0: [2022-11-25 22:43:04,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_26-model_00-model_states.pt... 0: [2022-11-25 22:43:04,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_26-model_00-model_states.pt. 0: [2022-11-25 22:43:04,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_27-model_00-model_states.pt... 0: [2022-11-25 22:43:04,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_27-model_00-model_states.pt. 0: [2022-11-25 22:43:04,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_28-model_00-model_states.pt... 0: [2022-11-25 22:43:04,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_28-model_00-model_states.pt. 0: [2022-11-25 22:43:04,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_29-model_00-model_states.pt... 0: [2022-11-25 22:43:05,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_29-model_00-model_states.pt. 0: [2022-11-25 22:43:05,038] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_30-model_00-model_states.pt... 0: [2022-11-25 22:43:05,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_30-model_00-model_states.pt. 0: [2022-11-25 22:43:05,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/layer_32-model_00-model_states.pt... 0: [2022-11-25 22:43:05,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/layer_32-model_00-model_states.pt. 0: [2022-11-25 22:43:05,155] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step10000/mp_rank_00_model_states.pt 0: [2022-11-25 22:43:05,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/mp_rank_00_model_states.pt... 0: [2022-11-25 22:43:05,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/mp_rank_00_model_states.pt. 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 5: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 14: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 2: [2022-11-25 22:43:05,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 12: [2022-11-25 22:43:05,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 22:43:05,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 22:43:05,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 22:43:05,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 22:43:05,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 22:43:05,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 22:43:05,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 22:43:05,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 22:43:05,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 22:43:05,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:43:05,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:43:05,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 5: [2022-11-25 22:43:05,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 22:43:05,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 22:43:05,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 22:43:05,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 22:43:05,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 22:43:05,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 22:43:05,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 10: [2022-11-25 22:43:05,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 22:43:05,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 12: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 22:43:05,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 7: [2022-11-25 22:43:05,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 22:43:05,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 22:43:05,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 22:43:05,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:43:05,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:43:05,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:43:05,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 22:43:05,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 22:43:05,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:43:05,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 22:43:05,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:43:05,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:43:05,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 14: [2022-11-25 22:43:05,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 22:43:05,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 22:43:05,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 9: [2022-11-25 22:43:05,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 22:43:05,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 22:43:05,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 13: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 22:43:05,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 22:43:05,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 22:43:05,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 4: [2022-11-25 22:43:05,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 22:43:05,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 22:43:05,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 15: [2022-11-25 22:43:05,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:43:05,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 22:43:05,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 1: [2022-11-25 22:43:05,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 22:43:05,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 22:43:05,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 3: [2022-11-25 22:43:05,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 22:43:05,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 22:43:05,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 2: [2022-11-25 22:43:05,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: [2022-11-25 22:43:05,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 22:43:05,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 8: [2022-11-25 22:43:05,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 22:43:05,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 22:43:05,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 6: [2022-11-25 22:43:05,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 22:43:05,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 22:43:05,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 11: [2022-11-25 22:43:05,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 22:43:05,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step10000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 22:43:05,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step10000 is ready now! 0: successfully saved checkpoint at iteration 10000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4267.64 15: iteration 10010/ 125429 | consumed samples: 2562560 | consumed tokens: 5248122880 | elapsed time per iteration (s): 1.51 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.310484E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 169.606 | TFLOPs: 28.03 | 15: iteration 10020/ 125429 | consumed samples: 2565120 | consumed tokens: 5253365760 | elapsed time per iteration (s): 1.04 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.334328E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.334 | TFLOPs: 40.71 | 15: iteration 10030/ 125429 | consumed samples: 2567680 | consumed tokens: 5258608640 | elapsed time per iteration (s): 1.04 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.334648E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.034 | TFLOPs: 40.82 | 15: iteration 10040/ 125429 | consumed samples: 2570240 | consumed tokens: 5263851520 | elapsed time per iteration (s): 1.03 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.332454E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.803 | TFLOPs: 40.95 | 15: iteration 10050/ 125429 | consumed samples: 2572800 | consumed tokens: 5269094400 | elapsed time per iteration (s): 1.03 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.310797E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.778 | TFLOPs: 41.11 | 15: iteration 10060/ 125429 | consumed samples: 2575360 | consumed tokens: 5274337280 | elapsed time per iteration (s): 1.03 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.314296E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.670 | TFLOPs: 41.26 | 15: iteration 10070/ 125429 | consumed samples: 2577920 | consumed tokens: 5279580160 | elapsed time per iteration (s): 1.04 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.312878E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.253 | TFLOPs: 40.86 | 15: iteration 10080/ 125429 | consumed samples: 2580480 | consumed tokens: 5284823040 | elapsed time per iteration (s): 1.02 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.279253E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.527 | TFLOPs: 41.40 | 15: iteration 10090/ 125429 | consumed samples: 2583040 | consumed tokens: 5290065920 | elapsed time per iteration (s): 1.03 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.294195E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.348 | TFLOPs: 41.21 | 15: iteration 10100/ 125429 | consumed samples: 2585600 | consumed tokens: 5295308800 | elapsed time per iteration (s): 1.06 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.324403E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.862 | TFLOPs: 39.80 | 15: iteration 10110/ 125429 | consumed samples: 2588160 | consumed tokens: 5300551680 | elapsed time per iteration (s): 1.07 | learning rate: 1.978E-04 | global batch size: 256 | lm loss: 2.296793E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.844 | TFLOPs: 39.47 | 15: iteration 10120/ 125429 | consumed samples: 2590720 | consumed tokens: 5305794560 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.283787E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.887 | TFLOPs: 40.80 | 15: iteration 10130/ 125429 | consumed samples: 2593280 | consumed tokens: 5311037440 | elapsed time per iteration (s): 1.02 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.307284E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.487 | TFLOPs: 41.56 | 15: iteration 10140/ 125429 | consumed samples: 2595840 | consumed tokens: 5316280320 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.319422E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.970 | TFLOPs: 40.81 | 15: iteration 10150/ 125429 | consumed samples: 2598400 | consumed tokens: 5321523200 | elapsed time per iteration (s): 1.08 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.285163E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.333 | TFLOPs: 39.06 | 15: iteration 10160/ 125429 | consumed samples: 2600960 | consumed tokens: 5326766080 | elapsed time per iteration (s): 1.03 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.305692E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.686 | TFLOPs: 40.93 | 15: iteration 10170/ 125429 | consumed samples: 2603520 | consumed tokens: 5332008960 | elapsed time per iteration (s): 1.06 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.325339E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.509 | TFLOPs: 40.08 | 15: iteration 10180/ 125429 | consumed samples: 2606080 | consumed tokens: 5337251840 | elapsed time per iteration (s): 1.07 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.297879E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.390 | TFLOPs: 39.40 | 15: iteration 10190/ 125429 | consumed samples: 2608640 | consumed tokens: 5342494720 | elapsed time per iteration (s): 1.07 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.312371E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.190 | TFLOPs: 39.69 | 15: iteration 10200/ 125429 | consumed samples: 2611200 | consumed tokens: 5347737600 | elapsed time per iteration (s): 1.06 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.297820E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.576 | TFLOPs: 39.92 | 15: iteration 10210/ 125429 | consumed samples: 2613760 | consumed tokens: 5352980480 | elapsed time per iteration (s): 1.05 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.301426E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.648 | TFLOPs: 40.43 | 15: iteration 10220/ 125429 | consumed samples: 2616320 | consumed tokens: 5358223360 | elapsed time per iteration (s): 1.06 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.280518E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.737 | TFLOPs: 39.78 | 15: iteration 10230/ 125429 | consumed samples: 2618880 | consumed tokens: 5363466240 | elapsed time per iteration (s): 1.03 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.296683E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.722 | TFLOPs: 41.10 | 15: iteration 10240/ 125429 | consumed samples: 2621440 | consumed tokens: 5368709120 | elapsed time per iteration (s): 1.03 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.314938E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.373 | TFLOPs: 40.88 | 15: iteration 10250/ 125429 | consumed samples: 2624000 | consumed tokens: 5373952000 | elapsed time per iteration (s): 1.06 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.306275E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.533 | TFLOPs: 40.08 | 15: iteration 10260/ 125429 | consumed samples: 2626560 | consumed tokens: 5379194880 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.278708E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.367 | TFLOPs: 40.71 | 15: iteration 10270/ 125429 | consumed samples: 2629120 | consumed tokens: 5384437760 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.304192E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.900 | TFLOPs: 40.64 | 15: iteration 10280/ 125429 | consumed samples: 2631680 | consumed tokens: 5389680640 | elapsed time per iteration (s): 1.05 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.313257E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.803 | TFLOPs: 40.13 | 15: iteration 10290/ 125429 | consumed samples: 2634240 | consumed tokens: 5394923520 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.325956E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.589 | TFLOPs: 40.59 | 15: iteration 10300/ 125429 | consumed samples: 2636800 | consumed tokens: 5400166400 | elapsed time per iteration (s): 1.04 | learning rate: 1.977E-04 | global batch size: 256 | lm loss: 2.281271E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.138 | TFLOPs: 40.51 | 15: iteration 10310/ 125429 | consumed samples: 2639360 | consumed tokens: 5405409280 | elapsed time per iteration (s): 1.03 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.311242E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.731 | TFLOPs: 40.94 | 15: iteration 10320/ 125429 | consumed samples: 2641920 | consumed tokens: 5410652160 | elapsed time per iteration (s): 1.04 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.329705E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.254 | TFLOPs: 40.86 | 15: iteration 10330/ 125429 | consumed samples: 2644480 | consumed tokens: 5415895040 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.288211E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.472 | TFLOPs: 40.07 | 15: iteration 10340/ 125429 | consumed samples: 2647040 | consumed tokens: 5421137920 | elapsed time per iteration (s): 1.04 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.329193E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.631 | TFLOPs: 40.59 | 15: iteration 10350/ 125429 | consumed samples: 2649600 | consumed tokens: 5426380800 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.300633E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.573 | TFLOPs: 40.09 | 15: iteration 10360/ 125429 | consumed samples: 2652160 | consumed tokens: 5431623680 | elapsed time per iteration (s): 1.05 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.305770E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.687 | TFLOPs: 40.27 | 15: iteration 10370/ 125429 | consumed samples: 2654720 | consumed tokens: 5436866560 | elapsed time per iteration (s): 1.05 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.300572E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.721 | TFLOPs: 40.28 | 15: iteration 10380/ 125429 | consumed samples: 2657280 | consumed tokens: 5442109440 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.295005E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.559 | TFLOPs: 40.08 | 15: iteration 10390/ 125429 | consumed samples: 2659840 | consumed tokens: 5447352320 | elapsed time per iteration (s): 1.04 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.282403E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.339 | TFLOPs: 40.54 | 15: iteration 10400/ 125429 | consumed samples: 2662400 | consumed tokens: 5452595200 | elapsed time per iteration (s): 1.05 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.295027E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.418 | TFLOPs: 40.39 | 15: iteration 10410/ 125429 | consumed samples: 2664960 | consumed tokens: 5457838080 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.286945E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.998 | TFLOPs: 39.83 | 15: iteration 10420/ 125429 | consumed samples: 2667520 | consumed tokens: 5463080960 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.309712E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.633 | TFLOPs: 39.93 | 15: iteration 10430/ 125429 | consumed samples: 2670080 | consumed tokens: 5468323840 | elapsed time per iteration (s): 1.03 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.304788E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.546 | TFLOPs: 41.24 | 15: iteration 10440/ 125429 | consumed samples: 2672640 | consumed tokens: 5473566720 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.293982E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.903 | TFLOPs: 39.98 | 15: iteration 10450/ 125429 | consumed samples: 2675200 | consumed tokens: 5478809600 | elapsed time per iteration (s): 1.03 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.281675E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.699 | TFLOPs: 40.93 | 15: iteration 10460/ 125429 | consumed samples: 2677760 | consumed tokens: 5484052480 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.289596E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.855 | TFLOPs: 39.97 | 15: iteration 10470/ 125429 | consumed samples: 2680320 | consumed tokens: 5489295360 | elapsed time per iteration (s): 1.05 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.292269E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.900 | TFLOPs: 40.14 | 15: iteration 10480/ 125429 | consumed samples: 2682880 | consumed tokens: 5494538240 | elapsed time per iteration (s): 1.06 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.288896E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.600 | TFLOPs: 39.76 | 15: iteration 10490/ 125429 | consumed samples: 2685440 | consumed tokens: 5499781120 | elapsed time per iteration (s): 1.04 | learning rate: 1.976E-04 | global batch size: 256 | lm loss: 2.283534E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.232 | TFLOPs: 40.86 | 15: iteration 10500/ 125429 | consumed samples: 2688000 | consumed tokens: 5505024000 | elapsed time per iteration (s): 1.02 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.288003E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.612 | TFLOPs: 41.42 | 15: iteration 10510/ 125429 | consumed samples: 2690560 | consumed tokens: 5510266880 | elapsed time per iteration (s): 1.04 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.302928E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.155 | TFLOPs: 40.68 | 15: iteration 10520/ 125429 | consumed samples: 2693120 | consumed tokens: 5515509760 | elapsed time per iteration (s): 1.11 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.333159E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.228 | TFLOPs: 38.05 | 15: iteration 10530/ 125429 | consumed samples: 2695680 | consumed tokens: 5520752640 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.290913E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.619 | TFLOPs: 41.25 | 15: iteration 10540/ 125429 | consumed samples: 2698240 | consumed tokens: 5525995520 | elapsed time per iteration (s): 1.05 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.283323E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.806 | TFLOPs: 40.29 | 15: iteration 10550/ 125429 | consumed samples: 2700800 | consumed tokens: 5531238400 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.272769E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.582 | TFLOPs: 41.25 | 15: iteration 10560/ 125429 | consumed samples: 2703360 | consumed tokens: 5536481280 | elapsed time per iteration (s): 1.04 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.279679E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.298 | TFLOPs: 40.70 | 15: iteration 10570/ 125429 | consumed samples: 2705920 | consumed tokens: 5541724160 | elapsed time per iteration (s): 1.09 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.277962E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.994 | TFLOPs: 38.83 | 15: iteration 10580/ 125429 | consumed samples: 2708480 | consumed tokens: 5546967040 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.282945E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.373 | TFLOPs: 41.21 | 15: iteration 10590/ 125429 | consumed samples: 2711040 | consumed tokens: 5552209920 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.315152E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.128 | TFLOPs: 41.17 | 15: iteration 10600/ 125429 | consumed samples: 2713600 | consumed tokens: 5557452800 | elapsed time per iteration (s): 1.04 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.280989E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.323 | TFLOPs: 40.87 | 15: iteration 10610/ 125429 | consumed samples: 2716160 | consumed tokens: 5562695680 | elapsed time per iteration (s): 1.02 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.261680E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.119 | TFLOPs: 41.33 | 15: iteration 10620/ 125429 | consumed samples: 2718720 | consumed tokens: 5567938560 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.275843E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.519 | TFLOPs: 41.07 | 15: iteration 10630/ 125429 | consumed samples: 2721280 | consumed tokens: 5573181440 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.260423E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.324 | TFLOPs: 41.04 | 15: iteration 10640/ 125429 | consumed samples: 2723840 | consumed tokens: 5578424320 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.290396E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.672 | TFLOPs: 41.09 | 15: iteration 10650/ 125429 | consumed samples: 2726400 | consumed tokens: 5583667200 | elapsed time per iteration (s): 1.04 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.295935E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.323 | TFLOPs: 40.87 | 15: iteration 10660/ 125429 | consumed samples: 2728960 | consumed tokens: 5588910080 | elapsed time per iteration (s): 1.04 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.288028E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.816 | TFLOPs: 40.79 | 15: iteration 10670/ 125429 | consumed samples: 2731520 | consumed tokens: 5594152960 | elapsed time per iteration (s): 1.03 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.306734E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.529 | TFLOPs: 41.24 | 15: iteration 10680/ 125429 | consumed samples: 2734080 | consumed tokens: 5599395840 | elapsed time per iteration (s): 1.02 | learning rate: 1.975E-04 | global batch size: 256 | lm loss: 2.282201E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.128 | TFLOPs: 41.34 | 15: iteration 10690/ 125429 | consumed samples: 2736640 | consumed tokens: 5604638720 | elapsed time per iteration (s): 1.05 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.299765E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.530 | TFLOPs: 40.25 | 15: iteration 10700/ 125429 | consumed samples: 2739200 | consumed tokens: 5609881600 | elapsed time per iteration (s): 1.07 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.283929E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.677 | TFLOPs: 39.44 | 15: iteration 10710/ 125429 | consumed samples: 2741760 | consumed tokens: 5615124480 | elapsed time per iteration (s): 1.04 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.304355E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.802 | TFLOPs: 40.79 | 15: iteration 10720/ 125429 | consumed samples: 2744320 | consumed tokens: 5620367360 | elapsed time per iteration (s): 1.05 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.312317E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.255 | TFLOPs: 40.37 | 15: iteration 10730/ 125429 | consumed samples: 2746880 | consumed tokens: 5625610240 | elapsed time per iteration (s): 1.07 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.280849E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.902 | TFLOPs: 39.65 | 15: iteration 10740/ 125429 | consumed samples: 2749440 | consumed tokens: 5630853120 | elapsed time per iteration (s): 1.03 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.263591E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.594 | TFLOPs: 41.08 | 15: iteration 10750/ 125429 | consumed samples: 2752000 | consumed tokens: 5636096000 | elapsed time per iteration (s): 1.05 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.287842E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.873 | TFLOPs: 40.47 | 15: iteration 10760/ 125429 | consumed samples: 2754560 | consumed tokens: 5641338880 | elapsed time per iteration (s): 1.06 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.290960E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.653 | TFLOPs: 40.10 | 15: iteration 10770/ 125429 | consumed samples: 2757120 | consumed tokens: 5646581760 | elapsed time per iteration (s): 1.03 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.254313E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.192 | TFLOPs: 41.02 | 15: iteration 10780/ 125429 | consumed samples: 2759680 | consumed tokens: 5651824640 | elapsed time per iteration (s): 1.03 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.251900E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.495 | TFLOPs: 40.90 | 15: iteration 10790/ 125429 | consumed samples: 2762240 | consumed tokens: 5657067520 | elapsed time per iteration (s): 1.08 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.294524E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.827 | TFLOPs: 39.30 | 15: iteration 10800/ 125429 | consumed samples: 2764800 | consumed tokens: 5662310400 | elapsed time per iteration (s): 1.04 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.258799E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.317 | TFLOPs: 40.71 | 15: iteration 10810/ 125429 | consumed samples: 2767360 | consumed tokens: 5667553280 | elapsed time per iteration (s): 1.04 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.268383E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.407 | TFLOPs: 40.56 | 15: iteration 10820/ 125429 | consumed samples: 2769920 | consumed tokens: 5672796160 | elapsed time per iteration (s): 1.05 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.275468E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.032 | TFLOPs: 40.33 | 15: iteration 10830/ 125429 | consumed samples: 2772480 | consumed tokens: 5678039040 | elapsed time per iteration (s): 1.04 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.281262E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.596 | TFLOPs: 40.59 | 15: iteration 10840/ 125429 | consumed samples: 2775040 | consumed tokens: 5683281920 | elapsed time per iteration (s): 1.05 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.276576E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.993 | TFLOPs: 40.32 | 15: iteration 10850/ 125429 | consumed samples: 2777600 | consumed tokens: 5688524800 | elapsed time per iteration (s): 1.02 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.244433E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.616 | TFLOPs: 41.42 | 15: iteration 10860/ 125429 | consumed samples: 2780160 | consumed tokens: 5693767680 | elapsed time per iteration (s): 1.03 | learning rate: 1.974E-04 | global batch size: 256 | lm loss: 2.304177E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.241 | TFLOPs: 41.02 | 15: iteration 10870/ 125429 | consumed samples: 2782720 | consumed tokens: 5699010560 | elapsed time per iteration (s): 1.02 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.285145E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.901 | TFLOPs: 41.46 | 15: iteration 10880/ 125429 | consumed samples: 2785280 | consumed tokens: 5704253440 | elapsed time per iteration (s): 1.06 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.295506E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.556 | TFLOPs: 39.92 | 15: iteration 10890/ 125429 | consumed samples: 2787840 | consumed tokens: 5709496320 | elapsed time per iteration (s): 1.03 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.277800E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.745 | TFLOPs: 41.27 | 15: iteration 10900/ 125429 | consumed samples: 2790400 | consumed tokens: 5714739200 | elapsed time per iteration (s): 1.05 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.273258E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.877 | TFLOPs: 40.14 | 15: iteration 10910/ 125429 | consumed samples: 2792960 | consumed tokens: 5719982080 | elapsed time per iteration (s): 1.06 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.306371E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.176 | TFLOPs: 40.02 | 15: iteration 10920/ 125429 | consumed samples: 2795520 | consumed tokens: 5725224960 | elapsed time per iteration (s): 1.03 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.257563E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.162 | TFLOPs: 41.01 | 15: iteration 10930/ 125429 | consumed samples: 2798080 | consumed tokens: 5730467840 | elapsed time per iteration (s): 1.04 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.266023E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.839 | TFLOPs: 40.79 | 15: iteration 10940/ 125429 | consumed samples: 2800640 | consumed tokens: 5735710720 | elapsed time per iteration (s): 1.03 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.239573E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.208 | TFLOPs: 41.02 | 15: iteration 10950/ 125429 | consumed samples: 2803200 | consumed tokens: 5740953600 | elapsed time per iteration (s): 1.05 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.298922E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.109 | TFLOPs: 40.18 | 15: iteration 10960/ 125429 | consumed samples: 2805760 | consumed tokens: 5746196480 | elapsed time per iteration (s): 1.06 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.271173E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.489 | TFLOPs: 40.07 | 15: iteration 10970/ 125429 | consumed samples: 2808320 | consumed tokens: 5751439360 | elapsed time per iteration (s): 1.05 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.276019E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.093 | TFLOPs: 40.34 | 15: iteration 10980/ 125429 | consumed samples: 2810880 | consumed tokens: 5756682240 | elapsed time per iteration (s): 1.05 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.272712E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.142 | TFLOPs: 40.18 | 15: iteration 10990/ 125429 | consumed samples: 2813440 | consumed tokens: 5761925120 | elapsed time per iteration (s): 1.03 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.284902E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.094 | TFLOPs: 41.00 | 15: iteration 11000/ 125429 | consumed samples: 2816000 | consumed tokens: 5767168000 | elapsed time per iteration (s): 1.05 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.283020E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.153 | TFLOPs: 40.35 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 11000 | lm loss value: 2.217402E+00 | lm loss PPL: 9.183445E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 11000 to checkpoints_1b5 0: [2022-11-25 23:00:30,310] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step11000 is begin to save! 0: [2022-11-25 23:00:30,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:00:30,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:00:30,564] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:00:30,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:00:30,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:00:30,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:00:30,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:00:30,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:00:30,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:00:30,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:00:30,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:00:31,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:00:31,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:00:31,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:00:31,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:00:31,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:00:31,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:00:31,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:00:31,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:00:31,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:00:31,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:00:31,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:00:31,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:00:31,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:00:31,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:00:31,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:00:31,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:00:32,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:00:32,075] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:00:32,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:00:32,177] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:00:32,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:00:32,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:00:32,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:00:32,394] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:00:32,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:00:32,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:00:32,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:00:32,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:00:32,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:00:32,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:00:32,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:00:32,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:00:32,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:00:32,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:00:33,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:00:33,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:00:33,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:00:33,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:00:33,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:00:33,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:00:33,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:00:33,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:00:33,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:00:33,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_29-model_00-model_states.pt... 0: [2022-11-25 23:00:33,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_29-model_00-model_states.pt. 0: [2022-11-25 23:00:33,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:00:33,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:00:33,687] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/layer_32-model_00-model_states.pt... 0: [2022-11-25 23:00:33,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/layer_32-model_00-model_states.pt. 0: [2022-11-25 23:00:33,690] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step11000/mp_rank_00_model_states.pt 0: [2022-11-25 23:00:33,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:00:33,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,734] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:00:33,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:33,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:33,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 23:00:33,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 23:00:33,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:33,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:33,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:33,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:33,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:33,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:33,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:33,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:33,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:33,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:00:33,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:33,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:33,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:33,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:33,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:33,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:00:33,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:33,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:33,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:00:33,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:00:33,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:33,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:33,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:33,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:00:33,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:33,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:33,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:33,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:33,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:00:33,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 23:00:33,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:33,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:33,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:33,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:33,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:00:33,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:00:33,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 6: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 6: [2022-11-25 23:00:33,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:33,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:33,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:33,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:33,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:33,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:33,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:33,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:33,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 15: [2022-11-25 23:00:33,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:00:33,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 23:00:33,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:33,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:33,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:33,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:33,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:33,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:33,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:33,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:33,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:33,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:33,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:33,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:33,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:33,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:33,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:33,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:33,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:33,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:33,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:33,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:33,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:33,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:33,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:33,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:33,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:33,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:33,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:33,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:33,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:33,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:33,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 9: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:00:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:33,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:33,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:33,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:33,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:33,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:33,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 23:00:33,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:33,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:33,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 23:00:33,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 12: [2022-11-25 23:00:33,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:00:33,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 23:00:33,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 1: [2022-11-25 23:00:33,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:33,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:33,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:33,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:33,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:33,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:33,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:33,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:33,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:33,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:33,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:33,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:33,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:33,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:33,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:33,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:33,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:00:33,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:33,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:33,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:33,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 11: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:00:33,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:00:33,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:33,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:33,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:33,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:33,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:33,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:33,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:33,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:33,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:33,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:33,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:33,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:33,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 13: [2022-11-25 23:00:34,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:00:34,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:00:34,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:34,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:34,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:34,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:34,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:34,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:34,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 7: [2022-11-25 23:00:34,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:00:34,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:00:34,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:00:34,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:34,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:34,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:33,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:33,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:33,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:33,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:33,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:33,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:33,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:33,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:33,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:33,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:33,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:33,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:34,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:34,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:34,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:34,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:34,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:34,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:34,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:34,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:34,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:34,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:34,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:34,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 5: [2022-11-25 23:00:34,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:00:34,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:00:34,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 14: [2022-11-25 23:00:34,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:00:34,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 23:00:34,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:34,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:34,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:34,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:34,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:34,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:34,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 4: [2022-11-25 23:00:34,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:00:34,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:00:34,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:00:34,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 23:00:34,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:34,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:34,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:34,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:34,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:34,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:00:34,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:34,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:00:34,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 2: [2022-11-25 23:00:34,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 3: [2022-11-25 23:00:34,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:00:34,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:00:34,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: [2022-11-25 23:00:34,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:00:34,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 8: [2022-11-25 23:00:34,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:00:34,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step11000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 23:00:34,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step11000 is ready now! 0: successfully saved checkpoint at iteration 11000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3855.51 15: iteration 11010/ 125429 | consumed samples: 2818560 | consumed tokens: 5772410880 | elapsed time per iteration (s): 1.46 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.261280E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.813 | TFLOPs: 28.89 | 15: iteration 11020/ 125429 | consumed samples: 2821120 | consumed tokens: 5777653760 | elapsed time per iteration (s): 1.03 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.307606E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.110 | TFLOPs: 41.17 | 15: iteration 11030/ 125429 | consumed samples: 2823680 | consumed tokens: 5782896640 | elapsed time per iteration (s): 4.34 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.250588E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 58.949 | TFLOPs: 9.74 | 15: iteration 11040/ 125429 | consumed samples: 2826240 | consumed tokens: 5788139520 | elapsed time per iteration (s): 1.05 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.276278E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.501 | TFLOPs: 40.41 | 15: iteration 11050/ 125429 | consumed samples: 2828800 | consumed tokens: 5793382400 | elapsed time per iteration (s): 1.03 | learning rate: 1.973E-04 | global batch size: 256 | lm loss: 2.292268E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.611 | TFLOPs: 41.25 | 15: iteration 11060/ 125429 | consumed samples: 2831360 | consumed tokens: 5798625280 | elapsed time per iteration (s): 1.59 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.256680E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 160.593 | TFLOPs: 26.54 | 15: iteration 11070/ 125429 | consumed samples: 2833920 | consumed tokens: 5803868160 | elapsed time per iteration (s): 1.02 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.264861E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.767 | TFLOPs: 41.44 | 15: iteration 11080/ 125429 | consumed samples: 2836480 | consumed tokens: 5809111040 | elapsed time per iteration (s): 1.03 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.283782E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.912 | TFLOPs: 40.97 | 15: iteration 11090/ 125429 | consumed samples: 2839040 | consumed tokens: 5814353920 | elapsed time per iteration (s): 1.02 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.266583E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.763 | TFLOPs: 41.44 | 15: iteration 11100/ 125429 | consumed samples: 2841600 | consumed tokens: 5819596800 | elapsed time per iteration (s): 1.03 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.292185E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.651 | TFLOPs: 41.09 | 15: iteration 11110/ 125429 | consumed samples: 2844160 | consumed tokens: 5824839680 | elapsed time per iteration (s): 1.03 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.299212E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.624 | TFLOPs: 40.92 | 15: iteration 11120/ 125429 | consumed samples: 2846720 | consumed tokens: 5830082560 | elapsed time per iteration (s): 1.03 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.285417E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.941 | TFLOPs: 40.97 | 15: iteration 11130/ 125429 | consumed samples: 2849280 | consumed tokens: 5835325440 | elapsed time per iteration (s): 1.03 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.249595E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.725 | TFLOPs: 40.94 | 15: iteration 11140/ 125429 | consumed samples: 2851840 | consumed tokens: 5840568320 | elapsed time per iteration (s): 1.05 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.276209E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.674 | TFLOPs: 40.10 | 15: iteration 11150/ 125429 | consumed samples: 2854400 | consumed tokens: 5845811200 | elapsed time per iteration (s): 1.04 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.263814E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.870 | TFLOPs: 40.63 | 15: iteration 11160/ 125429 | consumed samples: 2856960 | consumed tokens: 5851054080 | elapsed time per iteration (s): 1.02 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.284134E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.931 | TFLOPs: 41.47 | 15: iteration 11170/ 125429 | consumed samples: 2859520 | consumed tokens: 5856296960 | elapsed time per iteration (s): 22.84 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.271296E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 11.209 | TFLOPs: 1.85 | 15: iteration 11180/ 125429 | consumed samples: 2862080 | consumed tokens: 5861539840 | elapsed time per iteration (s): 1.03 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.277067E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.993 | TFLOPs: 41.15 | 15: iteration 11190/ 125429 | consumed samples: 2864640 | consumed tokens: 5866782720 | elapsed time per iteration (s): 1.04 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.225908E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.516 | TFLOPs: 40.57 | 15: iteration 11200/ 125429 | consumed samples: 2867200 | consumed tokens: 5872025600 | elapsed time per iteration (s): 1.74 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.313656E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 147.195 | TFLOPs: 24.33 | 15: iteration 11210/ 125429 | consumed samples: 2869760 | consumed tokens: 5877268480 | elapsed time per iteration (s): 1.02 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.276702E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.458 | TFLOPs: 41.39 | 15: iteration 11220/ 125429 | consumed samples: 2872320 | consumed tokens: 5882511360 | elapsed time per iteration (s): 1.05 | learning rate: 1.972E-04 | global batch size: 256 | lm loss: 2.301463E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.860 | TFLOPs: 40.46 | 15: iteration 11230/ 125429 | consumed samples: 2874880 | consumed tokens: 5887754240 | elapsed time per iteration (s): 1.04 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.284776E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.139 | TFLOPs: 40.84 | 15: iteration 11240/ 125429 | consumed samples: 2877440 | consumed tokens: 5892997120 | elapsed time per iteration (s): 1.03 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.292691E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.540 | TFLOPs: 41.24 | 15: iteration 11250/ 125429 | consumed samples: 2880000 | consumed tokens: 5898240000 | elapsed time per iteration (s): 1.03 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.284071E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.247 | TFLOPs: 41.02 | 15: iteration 11260/ 125429 | consumed samples: 2882560 | consumed tokens: 5903482880 | elapsed time per iteration (s): 1.03 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.270610E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.421 | TFLOPs: 41.22 | 15: iteration 11270/ 125429 | consumed samples: 2885120 | consumed tokens: 5908725760 | elapsed time per iteration (s): 1.03 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.273049E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.524 | TFLOPs: 41.24 | 15: iteration 11280/ 125429 | consumed samples: 2887680 | consumed tokens: 5913968640 | elapsed time per iteration (s): 1.05 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.305729E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.842 | TFLOPs: 40.46 | 15: iteration 11290/ 125429 | consumed samples: 2890240 | consumed tokens: 5919211520 | elapsed time per iteration (s): 1.10 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.264991E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.027 | TFLOPs: 38.51 | 15: iteration 11300/ 125429 | consumed samples: 2892800 | consumed tokens: 5924454400 | elapsed time per iteration (s): 1.04 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.253623E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.048 | TFLOPs: 40.50 | 15: iteration 11310/ 125429 | consumed samples: 2895360 | consumed tokens: 5929697280 | elapsed time per iteration (s): 1.05 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.264628E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.680 | TFLOPs: 40.27 | 15: iteration 11320/ 125429 | consumed samples: 2897920 | consumed tokens: 5934940160 | elapsed time per iteration (s): 1.03 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.258423E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.513 | TFLOPs: 41.07 | 15: iteration 11330/ 125429 | consumed samples: 2900480 | consumed tokens: 5940183040 | elapsed time per iteration (s): 1.04 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.257221E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.243 | TFLOPs: 40.86 | 15: iteration 11340/ 125429 | consumed samples: 2903040 | consumed tokens: 5945425920 | elapsed time per iteration (s): 1.04 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.276706E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.459 | TFLOPs: 40.73 | 15: iteration 11350/ 125429 | consumed samples: 2905600 | consumed tokens: 5950668800 | elapsed time per iteration (s): 1.02 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.282873E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.360 | TFLOPs: 41.37 | 15: iteration 11360/ 125429 | consumed samples: 2908160 | consumed tokens: 5955911680 | elapsed time per iteration (s): 1.05 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.292435E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.913 | TFLOPs: 40.31 | 15: iteration 11370/ 125429 | consumed samples: 2910720 | consumed tokens: 5961154560 | elapsed time per iteration (s): 1.03 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.274682E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.479 | TFLOPs: 41.23 | 15: iteration 11380/ 125429 | consumed samples: 2913280 | consumed tokens: 5966397440 | elapsed time per iteration (s): 1.03 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.270753E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.617 | TFLOPs: 41.09 | 15: iteration 11390/ 125429 | consumed samples: 2915840 | consumed tokens: 5971640320 | elapsed time per iteration (s): 1.04 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.264342E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.034 | TFLOPs: 40.49 | 15: iteration 11400/ 125429 | consumed samples: 2918400 | consumed tokens: 5976883200 | elapsed time per iteration (s): 1.04 | learning rate: 1.971E-04 | global batch size: 256 | lm loss: 2.299629E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.319 | TFLOPs: 40.87 | 15: iteration 11410/ 125429 | consumed samples: 2920960 | consumed tokens: 5982126080 | elapsed time per iteration (s): 1.04 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.243320E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.831 | TFLOPs: 40.79 | 15: iteration 11420/ 125429 | consumed samples: 2923520 | consumed tokens: 5987368960 | elapsed time per iteration (s): 1.02 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.241092E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.062 | TFLOPs: 41.32 | 15: iteration 11430/ 125429 | consumed samples: 2926080 | consumed tokens: 5992611840 | elapsed time per iteration (s): 1.03 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.260448E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.437 | TFLOPs: 41.06 | 15: iteration 11440/ 125429 | consumed samples: 2928640 | consumed tokens: 5997854720 | elapsed time per iteration (s): 1.05 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.289920E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.360 | TFLOPs: 40.38 | 15: iteration 11450/ 125429 | consumed samples: 2931200 | consumed tokens: 6003097600 | elapsed time per iteration (s): 1.07 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.251275E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.518 | TFLOPs: 39.42 | 15: iteration 11460/ 125429 | consumed samples: 2933760 | consumed tokens: 6008340480 | elapsed time per iteration (s): 1.05 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.266861E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.676 | TFLOPs: 40.27 | 15: iteration 11470/ 125429 | consumed samples: 2936320 | consumed tokens: 6013583360 | elapsed time per iteration (s): 1.04 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.283535E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.322 | TFLOPs: 40.71 | 15: iteration 11480/ 125429 | consumed samples: 2938880 | consumed tokens: 6018826240 | elapsed time per iteration (s): 1.05 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.294036E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.947 | TFLOPs: 40.31 | 15: iteration 11490/ 125429 | consumed samples: 2941440 | consumed tokens: 6024069120 | elapsed time per iteration (s): 1.02 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.277235E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.423 | TFLOPs: 41.38 | 15: iteration 11500/ 125429 | consumed samples: 2944000 | consumed tokens: 6029312000 | elapsed time per iteration (s): 1.06 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.292519E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.943 | TFLOPs: 39.98 | 15: iteration 11510/ 125429 | consumed samples: 2946560 | consumed tokens: 6034554880 | elapsed time per iteration (s): 1.02 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.290351E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.085 | TFLOPs: 41.33 | 15: iteration 11520/ 125429 | consumed samples: 2949120 | consumed tokens: 6039797760 | elapsed time per iteration (s): 1.04 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.271013E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.369 | TFLOPs: 40.55 | 15: iteration 11530/ 125429 | consumed samples: 2951680 | consumed tokens: 6045040640 | elapsed time per iteration (s): 1.02 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.243287E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.079 | TFLOPs: 41.49 | 15: iteration 11540/ 125429 | consumed samples: 2954240 | consumed tokens: 6050283520 | elapsed time per iteration (s): 1.03 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.289691E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.712 | TFLOPs: 40.94 | 15: iteration 11550/ 125429 | consumed samples: 2956800 | consumed tokens: 6055526400 | elapsed time per iteration (s): 1.05 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.284460E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.644 | TFLOPs: 40.43 | 15: iteration 11560/ 125429 | consumed samples: 2959360 | consumed tokens: 6060769280 | elapsed time per iteration (s): 1.04 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.284666E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.884 | TFLOPs: 40.80 | 15: iteration 11570/ 125429 | consumed samples: 2961920 | consumed tokens: 6066012160 | elapsed time per iteration (s): 1.02 | learning rate: 1.970E-04 | global batch size: 256 | lm loss: 2.263401E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.866 | TFLOPs: 41.29 | 15: iteration 11580/ 125429 | consumed samples: 2964480 | consumed tokens: 6071255040 | elapsed time per iteration (s): 1.04 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.244090E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.555 | TFLOPs: 40.58 | 15: iteration 11590/ 125429 | consumed samples: 2967040 | consumed tokens: 6076497920 | elapsed time per iteration (s): 1.03 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.277653E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.536 | TFLOPs: 41.07 | 15: iteration 11600/ 125429 | consumed samples: 2969600 | consumed tokens: 6081740800 | elapsed time per iteration (s): 1.05 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.278231E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.966 | TFLOPs: 40.32 | 15: iteration 11610/ 125429 | consumed samples: 2972160 | consumed tokens: 6086983680 | elapsed time per iteration (s): 1.03 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.274575E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.212 | TFLOPs: 41.02 | 15: iteration 11620/ 125429 | consumed samples: 2974720 | consumed tokens: 6092226560 | elapsed time per iteration (s): 1.07 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.305117E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.178 | TFLOPs: 39.36 | 15: iteration 11630/ 125429 | consumed samples: 2977280 | consumed tokens: 6097469440 | elapsed time per iteration (s): 1.06 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.258157E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.558 | TFLOPs: 40.08 | 15: iteration 11640/ 125429 | consumed samples: 2979840 | consumed tokens: 6102712320 | elapsed time per iteration (s): 1.04 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.288412E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.138 | TFLOPs: 40.68 | 15: iteration 11650/ 125429 | consumed samples: 2982400 | consumed tokens: 6107955200 | elapsed time per iteration (s): 1.05 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.243027E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.652 | TFLOPs: 40.43 | 15: iteration 11660/ 125429 | consumed samples: 2984960 | consumed tokens: 6113198080 | elapsed time per iteration (s): 1.03 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.273991E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.972 | TFLOPs: 40.98 | 15: iteration 11670/ 125429 | consumed samples: 2987520 | consumed tokens: 6118440960 | elapsed time per iteration (s): 1.05 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.249888E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.030 | TFLOPs: 40.16 | 15: iteration 11680/ 125429 | consumed samples: 2990080 | consumed tokens: 6123683840 | elapsed time per iteration (s): 1.08 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.289837E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.936 | TFLOPs: 39.16 | 15: iteration 11690/ 125429 | consumed samples: 2992640 | consumed tokens: 6128926720 | elapsed time per iteration (s): 1.03 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.282959E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.112 | TFLOPs: 41.17 | 15: iteration 11700/ 125429 | consumed samples: 2995200 | consumed tokens: 6134169600 | elapsed time per iteration (s): 1.10 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.247919E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.889 | TFLOPs: 38.32 | 15: iteration 11710/ 125429 | consumed samples: 2997760 | consumed tokens: 6139412480 | elapsed time per iteration (s): 1.04 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.270848E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.315 | TFLOPs: 40.87 | 15: iteration 11720/ 125429 | consumed samples: 3000320 | consumed tokens: 6144655360 | elapsed time per iteration (s): 1.04 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.258187E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.376 | TFLOPs: 40.72 | 15: iteration 11730/ 125429 | consumed samples: 3002880 | consumed tokens: 6149898240 | elapsed time per iteration (s): 1.02 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.261037E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.214 | TFLOPs: 41.35 | 15: iteration 11740/ 125429 | consumed samples: 3005440 | consumed tokens: 6155141120 | elapsed time per iteration (s): 1.02 | learning rate: 1.969E-04 | global batch size: 256 | lm loss: 2.264692E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.062 | TFLOPs: 41.49 | 15: iteration 11750/ 125429 | consumed samples: 3008000 | consumed tokens: 6160384000 | elapsed time per iteration (s): 1.06 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.240153E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.330 | TFLOPs: 39.88 | 15: iteration 11760/ 125429 | consumed samples: 3010560 | consumed tokens: 6165626880 | elapsed time per iteration (s): 1.04 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.274179E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.933 | TFLOPs: 40.81 | 15: iteration 11770/ 125429 | consumed samples: 3013120 | consumed tokens: 6170869760 | elapsed time per iteration (s): 1.03 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.281550E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.180 | TFLOPs: 41.01 | 15: iteration 11780/ 125429 | consumed samples: 3015680 | consumed tokens: 6176112640 | elapsed time per iteration (s): 1.04 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.246968E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.155 | TFLOPs: 40.84 | 15: iteration 11790/ 125429 | consumed samples: 3018240 | consumed tokens: 6181355520 | elapsed time per iteration (s): 1.05 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.261319E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.337 | TFLOPs: 40.38 | 15: iteration 11800/ 125429 | consumed samples: 3020800 | consumed tokens: 6186598400 | elapsed time per iteration (s): 1.04 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.260797E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.118 | TFLOPs: 40.84 | 15: iteration 11810/ 125429 | consumed samples: 3023360 | consumed tokens: 6191841280 | elapsed time per iteration (s): 1.04 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.250161E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.244 | TFLOPs: 40.53 | 15: iteration 11820/ 125429 | consumed samples: 3025920 | consumed tokens: 6197084160 | elapsed time per iteration (s): 1.06 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.264572E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.584 | TFLOPs: 39.92 | 15: iteration 11830/ 125429 | consumed samples: 3028480 | consumed tokens: 6202327040 | elapsed time per iteration (s): 1.04 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.207834E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.373 | TFLOPs: 40.55 | 15: iteration 11840/ 125429 | consumed samples: 3031040 | consumed tokens: 6207569920 | elapsed time per iteration (s): 1.03 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.248498E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.615 | TFLOPs: 40.92 | 15: iteration 11850/ 125429 | consumed samples: 3033600 | consumed tokens: 6212812800 | elapsed time per iteration (s): 1.03 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.280703E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.844 | TFLOPs: 40.96 | 15: iteration 11860/ 125429 | consumed samples: 3036160 | consumed tokens: 6218055680 | elapsed time per iteration (s): 1.04 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.213765E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.479 | TFLOPs: 40.73 | 15: iteration 11870/ 125429 | consumed samples: 3038720 | consumed tokens: 6223298560 | elapsed time per iteration (s): 1.05 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.260769E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.736 | TFLOPs: 40.28 | 15: iteration 11880/ 125429 | consumed samples: 3041280 | consumed tokens: 6228541440 | elapsed time per iteration (s): 1.02 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.258872E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.903 | TFLOPs: 41.30 | 15: iteration 11890/ 125429 | consumed samples: 3043840 | consumed tokens: 6233784320 | elapsed time per iteration (s): 1.04 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.271619E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.034 | TFLOPs: 40.66 | 15: iteration 11900/ 125429 | consumed samples: 3046400 | consumed tokens: 6239027200 | elapsed time per iteration (s): 1.05 | learning rate: 1.968E-04 | global batch size: 256 | lm loss: 2.235212E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.062 | TFLOPs: 40.33 | 15: iteration 11910/ 125429 | consumed samples: 3048960 | consumed tokens: 6244270080 | elapsed time per iteration (s): 1.08 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.287990E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.891 | TFLOPs: 39.15 | 15: iteration 11920/ 125429 | consumed samples: 3051520 | consumed tokens: 6249512960 | elapsed time per iteration (s): 1.11 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.284687E+00 | grad norm: 0.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.412 | TFLOPs: 38.24 | 15: iteration 11930/ 125429 | consumed samples: 3054080 | consumed tokens: 6254755840 | elapsed time per iteration (s): 1.08 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.286742E+00 | grad norm: 0.448 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.073 | TFLOPs: 39.34 | 15: iteration 11940/ 125429 | consumed samples: 3056640 | consumed tokens: 6259998720 | elapsed time per iteration (s): 1.04 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.285827E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.896 | TFLOPs: 40.80 | 15: iteration 11950/ 125429 | consumed samples: 3059200 | consumed tokens: 6265241600 | elapsed time per iteration (s): 1.04 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.225853E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.467 | TFLOPs: 40.73 | 15: iteration 11960/ 125429 | consumed samples: 3061760 | consumed tokens: 6270484480 | elapsed time per iteration (s): 1.04 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.263559E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.614 | TFLOPs: 40.75 | 15: iteration 11970/ 125429 | consumed samples: 3064320 | consumed tokens: 6275727360 | elapsed time per iteration (s): 1.08 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.260848E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.323 | TFLOPs: 39.22 | 15: iteration 11980/ 125429 | consumed samples: 3066880 | consumed tokens: 6280970240 | elapsed time per iteration (s): 1.07 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.289470E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.359 | TFLOPs: 39.72 | 15: iteration 11990/ 125429 | consumed samples: 3069440 | consumed tokens: 6286213120 | elapsed time per iteration (s): 1.09 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.262401E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.440 | TFLOPs: 38.74 | 0: [2022-11-25 23:22:20,069] [INFO] [logging.py:68:log_dist] [Rank 0] step=12000, skipped=0, lr=[0.00019669448373009732, 0.00019669448373009732, 0.00019669448373009732], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 12000/ 125429 | consumed samples: 3072000 | consumed tokens: 6291456000 | elapsed time per iteration (s): 1.05 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.284188E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.850 | TFLOPs: 40.13 | 0: steps: 12000 loss: 2.3367 iter time (s): 1.173 samples/sec: 218.332 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 12000 | lm loss value: 2.249554E+00 | lm loss PPL: 9.483509E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 12000 to checkpoints_1b5 0: [2022-11-25 23:22:20,416] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step12000 is begin to save! 0: [2022-11-25 23:22:20,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:22:20,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:22:20,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:22:20,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:22:20,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:22:20,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:22:20,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:22:20,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:22:20,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:22:21,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:22:21,082] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:22:21,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:22:21,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:22:21,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:22:21,292] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:22:21,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:22:21,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:22:21,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:22:21,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:22:21,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:22:21,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:22:21,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:22:21,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:22:21,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:22:21,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:22:21,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:22:21,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:22:22,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:22:22,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:22:22,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:22:22,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:22:22,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:22:22,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:22:22,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:22:22,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:22:22,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:22:22,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:22:22,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:22:22,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:22:22,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:22:22,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:22:22,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:22:22,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:22:22,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:22:22,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:22:22,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:22:23,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:22:23,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:22:23,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:22:23,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:22:23,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:22:23,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:22:23,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:22:23,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:22:23,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_29-model_00-model_states.pt... 0: [2022-11-25 23:22:23,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_29-model_00-model_states.pt. 0: [2022-11-25 23:22:23,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:22:23,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:22:23,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/layer_32-model_00-model_states.pt... 0: [2022-11-25 23:22:23,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/layer_32-model_00-model_states.pt. 0: [2022-11-25 23:22:23,628] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step12000/mp_rank_00_model_states.pt 0: [2022-11-25 23:22:23,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:22:23,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:22:23,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:22:23,671] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step12000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:22:23,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 23:22:23,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 23:22:23,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 23:22:23,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 23:22:23,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:22:23,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 23:22:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 2: [2022-11-25 23:22:23,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 23:22:23,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:22:23,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 23:22:23,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 23:22:23,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:22:23,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:22:23,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 23:22:23,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:22:23,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 23:22:23,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 23:22:23,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 23:22:23,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 23:22:23,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 23:22:23,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:22:23,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 23:22:23,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 23:22:23,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 23:22:23,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 23:22:23,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 23:22:23,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 23:22:23,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 23:22:23,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:22:23,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 23:22:23,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 23:22:23,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 23:22:23,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 23:22:23,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 23:22:23,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:22:23,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:22:23,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:22:23,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:22:23,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 23:22:23,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: [2022-11-25 23:22:23,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:22:23,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:22:23,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:22:23,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 7: [2022-11-25 23:22:23,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 23:22:23,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 23:22:23,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 23:22:23,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 23:22:23,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 23:22:23,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:22:23,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:22:23,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 23:22:23,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:23,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:23,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:22:23,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:22:23,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:22:23,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:22:23,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:22:23,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:22:23,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 11: [2022-11-25 23:22:23,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:22:23,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-25 23:22:23,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:22:23,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:22:23,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 6: [2022-11-25 23:22:23,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:22:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:22:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:22:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 5: [2022-11-25 23:22:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:23,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:23,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:23,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:22:23,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:22:23,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 15: [2022-11-25 23:22:23,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 23:22:23,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 14: [2022-11-25 23:22:23,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:22:23,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:22:23,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 13: [2022-11-25 23:22:23,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:22:23,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 23:22:23,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:22:23,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 2: [2022-11-25 23:22:23,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:22:23,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:22:23,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 1: [2022-11-25 23:22:23,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:22:23,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:22:23,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 9: [2022-11-25 23:22:23,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:22:23,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:22:23,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 10: [2022-11-25 23:22:23,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:22:23,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:22:23,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 4: [2022-11-25 23:22:23,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:22:23,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:22:23,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 3: [2022-11-25 23:22:24,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:22:24,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:22:24,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 23:22:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 12: [2022-11-25 23:22:24,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 8: [2022-11-25 23:22:24,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:22:24,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step12000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 23:22:24,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step12000 is ready now! 0: successfully saved checkpoint at iteration 12000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3663.35 15: iteration 12010/ 125429 | consumed samples: 3074560 | consumed tokens: 6296698880 | elapsed time per iteration (s): 1.43 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.261963E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.393 | TFLOPs: 29.65 | 15: iteration 12020/ 125429 | consumed samples: 3077120 | consumed tokens: 6301941760 | elapsed time per iteration (s): 1.04 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.223218E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.358 | TFLOPs: 40.71 | 15: iteration 12030/ 125429 | consumed samples: 3079680 | consumed tokens: 6307184640 | elapsed time per iteration (s): 1.03 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.235951E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.733 | TFLOPs: 41.27 | 15: iteration 12040/ 125429 | consumed samples: 3082240 | consumed tokens: 6312427520 | elapsed time per iteration (s): 1.06 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.252764E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.637 | TFLOPs: 39.77 | 15: iteration 12050/ 125429 | consumed samples: 3084800 | consumed tokens: 6317670400 | elapsed time per iteration (s): 1.03 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.235293E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.033 | TFLOPs: 40.99 | 15: iteration 12060/ 125429 | consumed samples: 3087360 | consumed tokens: 6322913280 | elapsed time per iteration (s): 1.07 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.243110E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.973 | TFLOPs: 39.66 | 15: iteration 12070/ 125429 | consumed samples: 3089920 | consumed tokens: 6328156160 | elapsed time per iteration (s): 1.05 | learning rate: 1.967E-04 | global batch size: 256 | lm loss: 2.267959E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.589 | TFLOPs: 40.26 | 15: iteration 12080/ 125429 | consumed samples: 3092480 | consumed tokens: 6333399040 | elapsed time per iteration (s): 1.08 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.249212E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.731 | TFLOPs: 39.12 | 15: iteration 12090/ 125429 | consumed samples: 3095040 | consumed tokens: 6338641920 | elapsed time per iteration (s): 1.04 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.243905E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.281 | TFLOPs: 40.87 | 15: iteration 12100/ 125429 | consumed samples: 3097600 | consumed tokens: 6343884800 | elapsed time per iteration (s): 1.05 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.241026E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.608 | TFLOPs: 40.42 | 15: iteration 12110/ 125429 | consumed samples: 3100160 | consumed tokens: 6349127680 | elapsed time per iteration (s): 1.08 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.216154E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.723 | TFLOPs: 39.12 | 15: iteration 12120/ 125429 | consumed samples: 3102720 | consumed tokens: 6354370560 | elapsed time per iteration (s): 1.03 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.269254E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.971 | TFLOPs: 40.98 | 15: iteration 12130/ 125429 | consumed samples: 3105280 | consumed tokens: 6359613440 | elapsed time per iteration (s): 1.04 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.273335E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.849 | TFLOPs: 40.79 | 15: iteration 12140/ 125429 | consumed samples: 3107840 | consumed tokens: 6364856320 | elapsed time per iteration (s): 1.03 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.269383E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.193 | TFLOPs: 41.18 | 15: iteration 12150/ 125429 | consumed samples: 3110400 | consumed tokens: 6370099200 | elapsed time per iteration (s): 1.05 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.280544E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.745 | TFLOPs: 40.28 | 15: iteration 12160/ 125429 | consumed samples: 3112960 | consumed tokens: 6375342080 | elapsed time per iteration (s): 1.11 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.346100E+00 | grad norm: 8.981 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.176 | TFLOPs: 38.20 | 15: iteration 12170/ 125429 | consumed samples: 3115520 | consumed tokens: 6380584960 | elapsed time per iteration (s): 1.07 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.484434E+00 | grad norm: 0.596 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.539 | TFLOPs: 39.59 | 15: iteration 12180/ 125429 | consumed samples: 3118080 | consumed tokens: 6385827840 | elapsed time per iteration (s): 1.04 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.338243E+00 | grad norm: 0.245 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.589 | TFLOPs: 40.75 | 15: iteration 12190/ 125429 | consumed samples: 3120640 | consumed tokens: 6391070720 | elapsed time per iteration (s): 1.02 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.291211E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.431 | TFLOPs: 41.55 | 15: iteration 12200/ 125429 | consumed samples: 3123200 | consumed tokens: 6396313600 | elapsed time per iteration (s): 1.05 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.304391E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.857 | TFLOPs: 40.13 | 15: iteration 12210/ 125429 | consumed samples: 3125760 | consumed tokens: 6401556480 | elapsed time per iteration (s): 2.35 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.292594E+00 | grad norm: 0.218 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 109.004 | TFLOPs: 18.01 | 15: iteration 12220/ 125429 | consumed samples: 3128320 | consumed tokens: 6406799360 | elapsed time per iteration (s): 1.04 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.275171E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.822 | TFLOPs: 40.62 | 15: iteration 12230/ 125429 | consumed samples: 3130880 | consumed tokens: 6412042240 | elapsed time per iteration (s): 1.05 | learning rate: 1.966E-04 | global batch size: 256 | lm loss: 2.274141E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.608 | TFLOPs: 40.26 | 15: iteration 12240/ 125429 | consumed samples: 3133440 | consumed tokens: 6417285120 | elapsed time per iteration (s): 1.05 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.239418E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.336 | TFLOPs: 40.21 | 15: iteration 12250/ 125429 | consumed samples: 3136000 | consumed tokens: 6422528000 | elapsed time per iteration (s): 1.05 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.266985E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.791 | TFLOPs: 40.12 | 15: iteration 12260/ 125429 | consumed samples: 3138560 | consumed tokens: 6427770880 | elapsed time per iteration (s): 1.03 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.267959E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.845 | TFLOPs: 40.96 | 15: iteration 12270/ 125429 | consumed samples: 3141120 | consumed tokens: 6433013760 | elapsed time per iteration (s): 1.04 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.269217E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.268 | TFLOPs: 40.70 | 15: iteration 12280/ 125429 | consumed samples: 3143680 | consumed tokens: 6438256640 | elapsed time per iteration (s): 1.03 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.262454E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.611 | TFLOPs: 40.92 | 15: iteration 12290/ 125429 | consumed samples: 3146240 | consumed tokens: 6443499520 | elapsed time per iteration (s): 1.04 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.271391E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.036 | TFLOPs: 40.82 | 15: iteration 12300/ 125429 | consumed samples: 3148800 | consumed tokens: 6448742400 | elapsed time per iteration (s): 1.06 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.261098E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.972 | TFLOPs: 39.82 | 15: iteration 12310/ 125429 | consumed samples: 3151360 | consumed tokens: 6453985280 | elapsed time per iteration (s): 1.04 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.281467E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.972 | TFLOPs: 40.81 | 15: iteration 12320/ 125429 | consumed samples: 3153920 | consumed tokens: 6459228160 | elapsed time per iteration (s): 1.06 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.244836E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.450 | TFLOPs: 40.07 | 15: iteration 12330/ 125429 | consumed samples: 3156480 | consumed tokens: 6464471040 | elapsed time per iteration (s): 1.06 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.281913E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.375 | TFLOPs: 39.89 | 15: iteration 12340/ 125429 | consumed samples: 3159040 | consumed tokens: 6469713920 | elapsed time per iteration (s): 1.04 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.268027E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.748 | TFLOPs: 40.61 | 15: iteration 12350/ 125429 | consumed samples: 3161600 | consumed tokens: 6474956800 | elapsed time per iteration (s): 1.04 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.262733E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.491 | TFLOPs: 40.73 | 15: iteration 12360/ 125429 | consumed samples: 3164160 | consumed tokens: 6480199680 | elapsed time per iteration (s): 1.04 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.269292E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.285 | TFLOPs: 40.87 | 15: iteration 12370/ 125429 | consumed samples: 3166720 | consumed tokens: 6485442560 | elapsed time per iteration (s): 1.03 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.262149E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.696 | TFLOPs: 41.26 | 15: iteration 12380/ 125429 | consumed samples: 3169280 | consumed tokens: 6490685440 | elapsed time per iteration (s): 1.05 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.271967E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.849 | TFLOPs: 40.46 | 15: iteration 12390/ 125429 | consumed samples: 3171840 | consumed tokens: 6495928320 | elapsed time per iteration (s): 1.02 | learning rate: 1.965E-04 | global batch size: 256 | lm loss: 2.236424E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.319 | TFLOPs: 41.37 | 15: iteration 12400/ 125429 | consumed samples: 3174400 | consumed tokens: 6501171200 | elapsed time per iteration (s): 1.04 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.283281E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.641 | TFLOPs: 40.76 | 15: iteration 12410/ 125429 | consumed samples: 3176960 | consumed tokens: 6506414080 | elapsed time per iteration (s): 1.05 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.286605E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.031 | TFLOPs: 40.16 | 15: iteration 12420/ 125429 | consumed samples: 3179520 | consumed tokens: 6511656960 | elapsed time per iteration (s): 1.04 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.255901E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.719 | TFLOPs: 40.61 | 15: iteration 12430/ 125429 | consumed samples: 3182080 | consumed tokens: 6516899840 | elapsed time per iteration (s): 1.06 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.263821E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.172 | TFLOPs: 40.02 | 15: iteration 12440/ 125429 | consumed samples: 3184640 | consumed tokens: 6522142720 | elapsed time per iteration (s): 1.05 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.237077E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.807 | TFLOPs: 40.46 | 15: iteration 12450/ 125429 | consumed samples: 3187200 | consumed tokens: 6527385600 | elapsed time per iteration (s): 1.06 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.255959E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.206 | TFLOPs: 40.03 | 15: iteration 12460/ 125429 | consumed samples: 3189760 | consumed tokens: 6532628480 | elapsed time per iteration (s): 1.05 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.284845E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.795 | TFLOPs: 40.45 | 15: iteration 12470/ 125429 | consumed samples: 3192320 | consumed tokens: 6537871360 | elapsed time per iteration (s): 1.04 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.255116E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.324 | TFLOPs: 40.71 | 15: iteration 12480/ 125429 | consumed samples: 3194880 | consumed tokens: 6543114240 | elapsed time per iteration (s): 1.08 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.232447E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.502 | TFLOPs: 39.25 | 15: iteration 12490/ 125429 | consumed samples: 3197440 | consumed tokens: 6548357120 | elapsed time per iteration (s): 1.15 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.230635E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.327 | TFLOPs: 36.74 | 15: iteration 12500/ 125429 | consumed samples: 3200000 | consumed tokens: 6553600000 | elapsed time per iteration (s): 1.04 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.233429E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.099 | TFLOPs: 40.50 | 15: iteration 12510/ 125429 | consumed samples: 3202560 | consumed tokens: 6558842880 | elapsed time per iteration (s): 1.05 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.236726E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.884 | TFLOPs: 40.47 | 15: iteration 12520/ 125429 | consumed samples: 3205120 | consumed tokens: 6564085760 | elapsed time per iteration (s): 1.12 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.261669E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.154 | TFLOPs: 37.87 | 15: iteration 12530/ 125429 | consumed samples: 3207680 | consumed tokens: 6569328640 | elapsed time per iteration (s): 1.11 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.240582E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.905 | TFLOPs: 38.16 | 15: iteration 12540/ 125429 | consumed samples: 3210240 | consumed tokens: 6574571520 | elapsed time per iteration (s): 1.07 | learning rate: 1.964E-04 | global batch size: 256 | lm loss: 2.223289E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.825 | TFLOPs: 39.63 | 15: iteration 12550/ 125429 | consumed samples: 3212800 | consumed tokens: 6579814400 | elapsed time per iteration (s): 1.09 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.254604E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.593 | TFLOPs: 38.93 | 15: iteration 12560/ 125429 | consumed samples: 3215360 | consumed tokens: 6585057280 | elapsed time per iteration (s): 1.09 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.288476E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.850 | TFLOPs: 38.81 | 15: iteration 12570/ 125429 | consumed samples: 3217920 | consumed tokens: 6590300160 | elapsed time per iteration (s): 1.07 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.251424E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.171 | TFLOPs: 39.69 | 15: iteration 12580/ 125429 | consumed samples: 3220480 | consumed tokens: 6595543040 | elapsed time per iteration (s): 1.04 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.253613E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.442 | TFLOPs: 40.56 | 15: iteration 12590/ 125429 | consumed samples: 3223040 | consumed tokens: 6600785920 | elapsed time per iteration (s): 1.07 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.272112E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.851 | TFLOPs: 39.64 | 15: iteration 12600/ 125429 | consumed samples: 3225600 | consumed tokens: 6606028800 | elapsed time per iteration (s): 1.03 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.247456E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.273 | TFLOPs: 41.03 | 15: iteration 12610/ 125429 | consumed samples: 3228160 | consumed tokens: 6611271680 | elapsed time per iteration (s): 1.04 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.241143E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.604 | TFLOPs: 40.75 | 15: iteration 12620/ 125429 | consumed samples: 3230720 | consumed tokens: 6616514560 | elapsed time per iteration (s): 1.07 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.221278E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.819 | TFLOPs: 39.63 | 15: iteration 12630/ 125429 | consumed samples: 3233280 | consumed tokens: 6621757440 | elapsed time per iteration (s): 1.03 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.275872E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.436 | TFLOPs: 40.89 | 15: iteration 12640/ 125429 | consumed samples: 3235840 | consumed tokens: 6627000320 | elapsed time per iteration (s): 1.03 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.253557E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.849 | TFLOPs: 40.96 | 15: iteration 12650/ 125429 | consumed samples: 3238400 | consumed tokens: 6632243200 | elapsed time per iteration (s): 1.04 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.231592E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.668 | TFLOPs: 40.60 | 15: iteration 12660/ 125429 | consumed samples: 3240960 | consumed tokens: 6637486080 | elapsed time per iteration (s): 1.03 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.242438E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.743 | TFLOPs: 40.94 | 15: iteration 12670/ 125429 | consumed samples: 3243520 | consumed tokens: 6642728960 | elapsed time per iteration (s): 1.05 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.242099E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.661 | TFLOPs: 40.43 | 15: iteration 12680/ 125429 | consumed samples: 3246080 | consumed tokens: 6647971840 | elapsed time per iteration (s): 1.04 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.273002E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.890 | TFLOPs: 40.64 | 15: iteration 12690/ 125429 | consumed samples: 3248640 | consumed tokens: 6653214720 | elapsed time per iteration (s): 1.05 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.235157E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.228 | TFLOPs: 40.20 | 15: iteration 12700/ 125429 | consumed samples: 3251200 | consumed tokens: 6658457600 | elapsed time per iteration (s): 1.02 | learning rate: 1.963E-04 | global batch size: 256 | lm loss: 2.250610E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.342 | TFLOPs: 41.37 | 15: iteration 12710/ 125429 | consumed samples: 3253760 | consumed tokens: 6663700480 | elapsed time per iteration (s): 1.05 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.236231E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.103 | TFLOPs: 40.17 | 15: iteration 12720/ 125429 | consumed samples: 3256320 | consumed tokens: 6668943360 | elapsed time per iteration (s): 1.06 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.233020E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.038 | TFLOPs: 40.00 | 15: iteration 12730/ 125429 | consumed samples: 3258880 | consumed tokens: 6674186240 | elapsed time per iteration (s): 1.04 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.228306E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.648 | TFLOPs: 40.60 | 15: iteration 12740/ 125429 | consumed samples: 3261440 | consumed tokens: 6679429120 | elapsed time per iteration (s): 1.06 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.239948E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.060 | TFLOPs: 39.84 | 15: iteration 12750/ 125429 | consumed samples: 3264000 | consumed tokens: 6684672000 | elapsed time per iteration (s): 1.04 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.274604E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.154 | TFLOPs: 40.51 | 15: iteration 12760/ 125429 | consumed samples: 3266560 | consumed tokens: 6689914880 | elapsed time per iteration (s): 1.06 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.247739E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.862 | TFLOPs: 39.80 | 15: iteration 12770/ 125429 | consumed samples: 3269120 | consumed tokens: 6695157760 | elapsed time per iteration (s): 1.03 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.235687E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.746 | TFLOPs: 41.11 | 15: iteration 12780/ 125429 | consumed samples: 3271680 | consumed tokens: 6700400640 | elapsed time per iteration (s): 1.06 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.252106E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.920 | TFLOPs: 39.98 | 15: iteration 12790/ 125429 | consumed samples: 3274240 | consumed tokens: 6705643520 | elapsed time per iteration (s): 1.03 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.237929E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.858 | TFLOPs: 40.96 | 15: iteration 12800/ 125429 | consumed samples: 3276800 | consumed tokens: 6710886400 | elapsed time per iteration (s): 1.04 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.256137E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.206 | TFLOPs: 40.85 | 15: iteration 12810/ 125429 | consumed samples: 3279360 | consumed tokens: 6716129280 | elapsed time per iteration (s): 1.03 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.249080E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.278 | TFLOPs: 41.03 | 15: iteration 12820/ 125429 | consumed samples: 3281920 | consumed tokens: 6721372160 | elapsed time per iteration (s): 1.05 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.240567E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.903 | TFLOPs: 40.47 | 15: iteration 12830/ 125429 | consumed samples: 3284480 | consumed tokens: 6726615040 | elapsed time per iteration (s): 1.04 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.244072E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.132 | TFLOPs: 40.84 | 15: iteration 12840/ 125429 | consumed samples: 3287040 | consumed tokens: 6731857920 | elapsed time per iteration (s): 1.05 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.241890E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.421 | TFLOPs: 40.39 | 15: iteration 12850/ 125429 | consumed samples: 3289600 | consumed tokens: 6737100800 | elapsed time per iteration (s): 1.06 | learning rate: 1.962E-04 | global batch size: 256 | lm loss: 2.204053E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.196 | TFLOPs: 39.86 | 15: iteration 12860/ 125429 | consumed samples: 3292160 | consumed tokens: 6742343680 | elapsed time per iteration (s): 1.05 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.235252E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.830 | TFLOPs: 40.13 | 15: iteration 12870/ 125429 | consumed samples: 3294720 | consumed tokens: 6747586560 | elapsed time per iteration (s): 1.02 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.233959E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.041 | TFLOPs: 41.32 | 15: iteration 12880/ 125429 | consumed samples: 3297280 | consumed tokens: 6752829440 | elapsed time per iteration (s): 1.10 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.250680E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.186 | TFLOPs: 38.37 | 15: iteration 12890/ 125429 | consumed samples: 3299840 | consumed tokens: 6758072320 | elapsed time per iteration (s): 1.06 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.252006E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.378 | TFLOPs: 39.89 | 15: iteration 12900/ 125429 | consumed samples: 3302400 | consumed tokens: 6763315200 | elapsed time per iteration (s): 1.05 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.244614E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.690 | TFLOPs: 40.27 | 15: iteration 12910/ 125429 | consumed samples: 3304960 | consumed tokens: 6768558080 | elapsed time per iteration (s): 1.08 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.244668E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.844 | TFLOPs: 39.14 | 15: iteration 12920/ 125429 | consumed samples: 3307520 | consumed tokens: 6773800960 | elapsed time per iteration (s): 1.04 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.282026E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.350 | TFLOPs: 40.55 | 15: iteration 12930/ 125429 | consumed samples: 3310080 | consumed tokens: 6779043840 | elapsed time per iteration (s): 1.03 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.254756E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.758 | TFLOPs: 40.94 | 15: iteration 12940/ 125429 | consumed samples: 3312640 | consumed tokens: 6784286720 | elapsed time per iteration (s): 1.02 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.258503E+00 | grad norm: 0.251 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.143 | TFLOPs: 41.50 | 15: iteration 12950/ 125429 | consumed samples: 3315200 | consumed tokens: 6789529600 | elapsed time per iteration (s): 1.03 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.240170E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.914 | TFLOPs: 41.13 | 15: iteration 12960/ 125429 | consumed samples: 3317760 | consumed tokens: 6794772480 | elapsed time per iteration (s): 1.04 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.217174E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.149 | TFLOPs: 40.68 | 15: iteration 12970/ 125429 | consumed samples: 3320320 | consumed tokens: 6800015360 | elapsed time per iteration (s): 1.07 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.228154E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.193 | TFLOPs: 39.36 | 15: iteration 12980/ 125429 | consumed samples: 3322880 | consumed tokens: 6805258240 | elapsed time per iteration (s): 1.07 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.259217E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.363 | TFLOPs: 39.72 | 15: iteration 12990/ 125429 | consumed samples: 3325440 | consumed tokens: 6810501120 | elapsed time per iteration (s): 1.05 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.213166E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.891 | TFLOPs: 40.47 | 15: iteration 13000/ 125429 | consumed samples: 3328000 | consumed tokens: 6815744000 | elapsed time per iteration (s): 1.06 | learning rate: 1.961E-04 | global batch size: 256 | lm loss: 2.252230E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.408 | TFLOPs: 40.06 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 13000 | lm loss value: 2.204823E+00 | lm loss PPL: 9.068644E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 13000 to checkpoints_1b5 0: [2022-11-25 23:40:08,032] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step13000 is begin to save! 0: [2022-11-25 23:40:08,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:40:08,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:40:08,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:40:08,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:40:08,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:40:08,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:40:08,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:40:08,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:40:08,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:40:08,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:40:08,782] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:40:08,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:40:08,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:40:09,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:40:09,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:40:09,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:40:09,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:40:09,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:40:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:40:09,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:40:09,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:40:09,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:40:09,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:40:09,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:40:09,578] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:40:09,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:40:09,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:40:09,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:40:09,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:40:09,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:40:09,918] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:40:10,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:40:10,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:40:10,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:40:10,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:40:10,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:40:10,256] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:40:10,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:40:10,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:40:10,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:40:10,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:40:10,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:40:10,585] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:40:10,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:40:10,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:40:10,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:40:10,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:40:10,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:40:10,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:40:11,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:40:11,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:40:11,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:40:11,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:40:11,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:40:11,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_29-model_00-model_states.pt... 0: [2022-11-25 23:40:11,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_29-model_00-model_states.pt. 0: [2022-11-25 23:40:11,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:40:11,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:40:11,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/layer_32-model_00-model_states.pt... 0: [2022-11-25 23:40:11,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/layer_32-model_00-model_states.pt. 0: [2022-11-25 23:40:11,463] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step13000/mp_rank_00_model_states.pt 0: [2022-11-25 23:40:11,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:40:11,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:40:11,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step13000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:40:11,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 23:40:11,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:40:11,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 23:40:11,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 23:40:11,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:40:11,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:40:11,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:40:11,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:40:11,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:40:11,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 23:40:11,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 23:40:11,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 23:40:11,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 23:40:11,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 23:40:11,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 23:40:11,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:40:11,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:40:11,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:40:11,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:40:11,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:40:11,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 8: [2022-11-25 23:40:11,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:40:11,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-25 23:40:11,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:40:11,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 23:40:11,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:40:11,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-25 23:40:11,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:40:11,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:40:11,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:40:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 23:40:11,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:40:11,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:40:11,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:40:11,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 1: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:40:11,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 5: [2022-11-25 23:40:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 23:40:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 5: [2022-11-25 23:40:11,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:40:11,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 23:40:11,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:40:11,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:40:11,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 7: [2022-11-25 23:40:11,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:40:11,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:40:11,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 23:40:11,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:40:11,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 6: [2022-11-25 23:40:11,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:40:11,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 23:40:11,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 8: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:40:11,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:40:11,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:40:11,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:40:11,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 8: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 11: [2022-11-25 23:40:11,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:40:11,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-25 23:40:11,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 23:40:11,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 23:40:11,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 23:40:11,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 9: [2022-11-25 23:40:11,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:40:11,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 23:40:11,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:40:11,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 23:40:11,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 23:40:11,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 23:40:11,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 23:40:11,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 23:40:11,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 14: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:40:11,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:40:11,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 10: [2022-11-25 23:40:11,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:40:11,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:40:11,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:40:11,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 15: [2022-11-25 23:40:11,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:40:11,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 23:40:11,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:40:11,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 3: [2022-11-25 23:40:11,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 23:40:11,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 4: [2022-11-25 23:40:11,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:40:11,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:40:11,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:40:11,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 12: [2022-11-25 23:40:11,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:40:11,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-25 23:40:11,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 8: [2022-11-25 23:40:11,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:40:11,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-25 23:40:11,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 13: [2022-11-25 23:40:11,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:40:11,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: [2022-11-25 23:40:11,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:40:11,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 2: [2022-11-25 23:40:11,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:40:11,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step13000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:40:11,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step13000 is ready now! 0: successfully saved checkpoint at iteration 13000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3904.77 15: iteration 13010/ 125429 | consumed samples: 3330560 | consumed tokens: 6820986880 | elapsed time per iteration (s): 1.46 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.257113E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.466 | TFLOPs: 29.00 | 15: iteration 13020/ 125429 | consumed samples: 3333120 | consumed tokens: 6826229760 | elapsed time per iteration (s): 1.03 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.236032E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.746 | TFLOPs: 41.11 | 15: iteration 13030/ 125429 | consumed samples: 3335680 | consumed tokens: 6831472640 | elapsed time per iteration (s): 1.08 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.223705E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.002 | TFLOPs: 39.17 | 15: iteration 13040/ 125429 | consumed samples: 3338240 | consumed tokens: 6836715520 | elapsed time per iteration (s): 1.05 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.243616E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.142 | TFLOPs: 40.18 | 15: iteration 13050/ 125429 | consumed samples: 3340800 | consumed tokens: 6841958400 | elapsed time per iteration (s): 1.06 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.206400E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.036 | TFLOPs: 40.00 | 15: iteration 13060/ 125429 | consumed samples: 3343360 | consumed tokens: 6847201280 | elapsed time per iteration (s): 1.05 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.232802E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.904 | TFLOPs: 40.31 | 15: iteration 13070/ 125429 | consumed samples: 3345920 | consumed tokens: 6852444160 | elapsed time per iteration (s): 1.05 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.248526E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.805 | TFLOPs: 40.13 | 15: iteration 13080/ 125429 | consumed samples: 3348480 | consumed tokens: 6857687040 | elapsed time per iteration (s): 1.05 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.264688E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.200 | TFLOPs: 40.19 | 15: iteration 13090/ 125429 | consumed samples: 3351040 | consumed tokens: 6862929920 | elapsed time per iteration (s): 1.05 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.232172E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.969 | TFLOPs: 40.48 | 15: iteration 13100/ 125429 | consumed samples: 3353600 | consumed tokens: 6868172800 | elapsed time per iteration (s): 1.04 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.265625E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.939 | TFLOPs: 40.81 | 15: iteration 13110/ 125429 | consumed samples: 3356160 | consumed tokens: 6873415680 | elapsed time per iteration (s): 1.09 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.235098E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.699 | TFLOPs: 38.95 | 15: iteration 13120/ 125429 | consumed samples: 3358720 | consumed tokens: 6878658560 | elapsed time per iteration (s): 1.08 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.242299E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.583 | TFLOPs: 39.26 | 15: iteration 13130/ 125429 | consumed samples: 3361280 | consumed tokens: 6883901440 | elapsed time per iteration (s): 1.10 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.220318E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.060 | TFLOPs: 38.35 | 15: iteration 13140/ 125429 | consumed samples: 3363840 | consumed tokens: 6889144320 | elapsed time per iteration (s): 1.04 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.198976E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.671 | TFLOPs: 40.60 | 15: iteration 13150/ 125429 | consumed samples: 3366400 | consumed tokens: 6894387200 | elapsed time per iteration (s): 1.02 | learning rate: 1.960E-04 | global batch size: 256 | lm loss: 2.246706E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.222 | TFLOPs: 41.35 | 15: iteration 13160/ 125429 | consumed samples: 3368960 | consumed tokens: 6899630080 | elapsed time per iteration (s): 1.06 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.248685E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.652 | TFLOPs: 39.77 | 15: iteration 13170/ 125429 | consumed samples: 3371520 | consumed tokens: 6904872960 | elapsed time per iteration (s): 1.05 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.239055E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.163 | TFLOPs: 40.18 | 15: iteration 13180/ 125429 | consumed samples: 3374080 | consumed tokens: 6910115840 | elapsed time per iteration (s): 1.06 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.208901E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.583 | TFLOPs: 39.92 | 15: iteration 13190/ 125429 | consumed samples: 3376640 | consumed tokens: 6915358720 | elapsed time per iteration (s): 1.06 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.247366E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.915 | TFLOPs: 39.98 | 15: iteration 13200/ 125429 | consumed samples: 3379200 | consumed tokens: 6920601600 | elapsed time per iteration (s): 1.03 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.235465E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.919 | TFLOPs: 41.14 | 15: iteration 13210/ 125429 | consumed samples: 3381760 | consumed tokens: 6925844480 | elapsed time per iteration (s): 1.04 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.207088E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.823 | TFLOPs: 40.62 | 15: iteration 13220/ 125429 | consumed samples: 3384320 | consumed tokens: 6931087360 | elapsed time per iteration (s): 1.05 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.257309E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.880 | TFLOPs: 40.30 | 15: iteration 13230/ 125429 | consumed samples: 3386880 | consumed tokens: 6936330240 | elapsed time per iteration (s): 1.05 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.246106E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.909 | TFLOPs: 40.14 | 15: iteration 13240/ 125429 | consumed samples: 3389440 | consumed tokens: 6941573120 | elapsed time per iteration (s): 1.08 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.239177E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.101 | TFLOPs: 39.18 | 15: iteration 13250/ 125429 | consumed samples: 3392000 | consumed tokens: 6946816000 | elapsed time per iteration (s): 1.05 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.244486E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.253 | TFLOPs: 40.36 | 15: iteration 13260/ 125429 | consumed samples: 3394560 | consumed tokens: 6952058880 | elapsed time per iteration (s): 1.03 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.252930E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.279 | TFLOPs: 41.20 | 15: iteration 13270/ 125429 | consumed samples: 3397120 | consumed tokens: 6957301760 | elapsed time per iteration (s): 1.06 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.254788E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.944 | TFLOPs: 39.82 | 15: iteration 13280/ 125429 | consumed samples: 3399680 | consumed tokens: 6962544640 | elapsed time per iteration (s): 1.06 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.209677E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.924 | TFLOPs: 39.98 | 15: iteration 13290/ 125429 | consumed samples: 3402240 | consumed tokens: 6967787520 | elapsed time per iteration (s): 1.05 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.254983E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.930 | TFLOPs: 40.31 | 15: iteration 13300/ 125429 | consumed samples: 3404800 | consumed tokens: 6973030400 | elapsed time per iteration (s): 1.07 | learning rate: 1.959E-04 | global batch size: 256 | lm loss: 2.245835E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.026 | TFLOPs: 39.50 | 15: iteration 13310/ 125429 | consumed samples: 3407360 | consumed tokens: 6978273280 | elapsed time per iteration (s): 1.05 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.246827E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.091 | TFLOPs: 40.17 | 15: iteration 13320/ 125429 | consumed samples: 3409920 | consumed tokens: 6983516160 | elapsed time per iteration (s): 1.02 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.231253E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.804 | TFLOPs: 41.45 | 15: iteration 13330/ 125429 | consumed samples: 3412480 | consumed tokens: 6988759040 | elapsed time per iteration (s): 1.03 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.230163E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.765 | TFLOPs: 40.95 | 15: iteration 13340/ 125429 | consumed samples: 3415040 | consumed tokens: 6994001920 | elapsed time per iteration (s): 1.05 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.231445E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.957 | TFLOPs: 40.48 | 15: iteration 13350/ 125429 | consumed samples: 3417600 | consumed tokens: 6999244800 | elapsed time per iteration (s): 1.04 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.228054E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.628 | TFLOPs: 40.76 | 15: iteration 13360/ 125429 | consumed samples: 3420160 | consumed tokens: 7004487680 | elapsed time per iteration (s): 1.04 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.238157E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.910 | TFLOPs: 40.64 | 15: iteration 13370/ 125429 | consumed samples: 3422720 | consumed tokens: 7009730560 | elapsed time per iteration (s): 1.04 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.235420E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.261 | TFLOPs: 40.53 | 15: iteration 13380/ 125429 | consumed samples: 3425280 | consumed tokens: 7014973440 | elapsed time per iteration (s): 1.03 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.231064E+00 | grad norm: 0.195 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.534 | TFLOPs: 40.91 | 15: iteration 13390/ 125429 | consumed samples: 3427840 | consumed tokens: 7020216320 | elapsed time per iteration (s): 1.07 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.238722E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.640 | TFLOPs: 39.60 | 15: iteration 13400/ 125429 | consumed samples: 3430400 | consumed tokens: 7025459200 | elapsed time per iteration (s): 1.02 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.241163E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.110 | TFLOPs: 41.33 | 15: iteration 13410/ 125429 | consumed samples: 3432960 | consumed tokens: 7030702080 | elapsed time per iteration (s): 1.03 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.263400E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.069 | TFLOPs: 41.00 | 15: iteration 13420/ 125429 | consumed samples: 3435520 | consumed tokens: 7035944960 | elapsed time per iteration (s): 1.07 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.272888E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.945 | TFLOPs: 39.49 | 15: iteration 13430/ 125429 | consumed samples: 3438080 | consumed tokens: 7041187840 | elapsed time per iteration (s): 1.04 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.220222E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.840 | TFLOPs: 40.63 | 15: iteration 13440/ 125429 | consumed samples: 3440640 | consumed tokens: 7046430720 | elapsed time per iteration (s): 1.05 | learning rate: 1.958E-04 | global batch size: 256 | lm loss: 2.232598E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.182 | TFLOPs: 40.35 | 15: iteration 13450/ 125429 | consumed samples: 3443200 | consumed tokens: 7051673600 | elapsed time per iteration (s): 1.03 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.245575E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.475 | TFLOPs: 41.06 | 15: iteration 13460/ 125429 | consumed samples: 3445760 | consumed tokens: 7056916480 | elapsed time per iteration (s): 1.07 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.258508E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.762 | TFLOPs: 39.62 | 15: iteration 13470/ 125429 | consumed samples: 3448320 | consumed tokens: 7062159360 | elapsed time per iteration (s): 1.06 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.208587E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.614 | TFLOPs: 40.09 | 15: iteration 13480/ 125429 | consumed samples: 3450880 | consumed tokens: 7067402240 | elapsed time per iteration (s): 1.03 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.228280E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.820 | TFLOPs: 40.95 | 15: iteration 13490/ 125429 | consumed samples: 3453440 | consumed tokens: 7072645120 | elapsed time per iteration (s): 1.15 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.218640E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.405 | TFLOPs: 36.75 | 15: iteration 13500/ 125429 | consumed samples: 3456000 | consumed tokens: 7077888000 | elapsed time per iteration (s): 1.08 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.210788E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.231 | TFLOPs: 39.04 | 15: iteration 13510/ 125429 | consumed samples: 3458560 | consumed tokens: 7083130880 | elapsed time per iteration (s): 1.03 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.251063E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.636 | TFLOPs: 40.92 | 15: iteration 13520/ 125429 | consumed samples: 3461120 | consumed tokens: 7088373760 | elapsed time per iteration (s): 1.06 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.253633E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.063 | TFLOPs: 39.84 | 15: iteration 13530/ 125429 | consumed samples: 3463680 | consumed tokens: 7093616640 | elapsed time per iteration (s): 1.04 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.220864E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.310 | TFLOPs: 40.70 | 15: iteration 13540/ 125429 | consumed samples: 3466240 | consumed tokens: 7098859520 | elapsed time per iteration (s): 1.15 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.257317E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.435 | TFLOPs: 36.92 | 15: iteration 13550/ 125429 | consumed samples: 3468800 | consumed tokens: 7104102400 | elapsed time per iteration (s): 1.06 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.239096E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.397 | TFLOPs: 39.89 | 15: iteration 13560/ 125429 | consumed samples: 3471360 | consumed tokens: 7109345280 | elapsed time per iteration (s): 1.12 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.227966E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.927 | TFLOPs: 37.83 | 15: iteration 13570/ 125429 | consumed samples: 3473920 | consumed tokens: 7114588160 | elapsed time per iteration (s): 1.06 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.244551E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.354 | TFLOPs: 40.05 | 15: iteration 13580/ 125429 | consumed samples: 3476480 | consumed tokens: 7119831040 | elapsed time per iteration (s): 1.04 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.234099E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.599 | TFLOPs: 40.75 | 15: iteration 13590/ 125429 | consumed samples: 3479040 | consumed tokens: 7125073920 | elapsed time per iteration (s): 1.06 | learning rate: 1.957E-04 | global batch size: 256 | lm loss: 2.223315E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.284 | TFLOPs: 40.04 | 15: iteration 13600/ 125429 | consumed samples: 3481600 | consumed tokens: 7130316800 | elapsed time per iteration (s): 1.05 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.271600E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.199 | TFLOPs: 40.36 | 15: iteration 13610/ 125429 | consumed samples: 3484160 | consumed tokens: 7135559680 | elapsed time per iteration (s): 1.04 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.238415E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.131 | TFLOPs: 40.68 | 15: iteration 13620/ 125429 | consumed samples: 3486720 | consumed tokens: 7140802560 | elapsed time per iteration (s): 1.09 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.219884E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.323 | TFLOPs: 38.89 | 15: iteration 13630/ 125429 | consumed samples: 3489280 | consumed tokens: 7146045440 | elapsed time per iteration (s): 1.10 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.238784E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.689 | TFLOPs: 38.29 | 15: iteration 13640/ 125429 | consumed samples: 3491840 | consumed tokens: 7151288320 | elapsed time per iteration (s): 1.06 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.212621E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.312 | TFLOPs: 40.04 | 15: iteration 13650/ 125429 | consumed samples: 3494400 | consumed tokens: 7156531200 | elapsed time per iteration (s): 1.05 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.249149E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.008 | TFLOPs: 40.16 | 15: iteration 13660/ 125429 | consumed samples: 3496960 | consumed tokens: 7161774080 | elapsed time per iteration (s): 1.06 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.228657E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.437 | TFLOPs: 39.90 | 15: iteration 13670/ 125429 | consumed samples: 3499520 | consumed tokens: 7167016960 | elapsed time per iteration (s): 1.03 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.252980E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.047 | TFLOPs: 40.99 | 15: iteration 13680/ 125429 | consumed samples: 3502080 | consumed tokens: 7172259840 | elapsed time per iteration (s): 1.05 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.223497E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.535 | TFLOPs: 40.41 | 15: iteration 13690/ 125429 | consumed samples: 3504640 | consumed tokens: 7177502720 | elapsed time per iteration (s): 1.04 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.244667E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.378 | TFLOPs: 40.72 | 15: iteration 13700/ 125429 | consumed samples: 3507200 | consumed tokens: 7182745600 | elapsed time per iteration (s): 1.04 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.262203E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.719 | TFLOPs: 40.77 | 15: iteration 13710/ 125429 | consumed samples: 3509760 | consumed tokens: 7187988480 | elapsed time per iteration (s): 1.05 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.231411E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.910 | TFLOPs: 40.47 | 15: iteration 13720/ 125429 | consumed samples: 3512320 | consumed tokens: 7193231360 | elapsed time per iteration (s): 1.03 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.208660E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.566 | TFLOPs: 41.08 | 15: iteration 13730/ 125429 | consumed samples: 3514880 | consumed tokens: 7198474240 | elapsed time per iteration (s): 1.06 | learning rate: 1.956E-04 | global batch size: 256 | lm loss: 2.238947E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.528 | TFLOPs: 39.75 | 15: iteration 13740/ 125429 | consumed samples: 3517440 | consumed tokens: 7203717120 | elapsed time per iteration (s): 1.03 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.244556E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.633 | TFLOPs: 41.25 | 15: iteration 13750/ 125429 | consumed samples: 3520000 | consumed tokens: 7208960000 | elapsed time per iteration (s): 1.05 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.238638E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.447 | TFLOPs: 40.23 | 15: iteration 13760/ 125429 | consumed samples: 3522560 | consumed tokens: 7214202880 | elapsed time per iteration (s): 1.05 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.210449E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.692 | TFLOPs: 40.27 | 15: iteration 13770/ 125429 | consumed samples: 3525120 | consumed tokens: 7219445760 | elapsed time per iteration (s): 1.07 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.268889E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.079 | TFLOPs: 39.51 | 15: iteration 13780/ 125429 | consumed samples: 3527680 | consumed tokens: 7224688640 | elapsed time per iteration (s): 1.05 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.212743E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.882 | TFLOPs: 40.14 | 15: iteration 13790/ 125429 | consumed samples: 3530240 | consumed tokens: 7229931520 | elapsed time per iteration (s): 1.03 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.242935E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.994 | TFLOPs: 41.15 | 15: iteration 13800/ 125429 | consumed samples: 3532800 | consumed tokens: 7235174400 | elapsed time per iteration (s): 1.05 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.245609E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.164 | TFLOPs: 40.35 | 15: iteration 13810/ 125429 | consumed samples: 3535360 | consumed tokens: 7240417280 | elapsed time per iteration (s): 1.05 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.214305E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.890 | TFLOPs: 40.47 | 15: iteration 13820/ 125429 | consumed samples: 3537920 | consumed tokens: 7245660160 | elapsed time per iteration (s): 1.04 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.224629E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.434 | TFLOPs: 40.56 | 15: iteration 13830/ 125429 | consumed samples: 3540480 | consumed tokens: 7250903040 | elapsed time per iteration (s): 1.04 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.247285E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.950 | TFLOPs: 40.65 | 15: iteration 13840/ 125429 | consumed samples: 3543040 | consumed tokens: 7256145920 | elapsed time per iteration (s): 1.04 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.217616E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.079 | TFLOPs: 40.83 | 15: iteration 13850/ 125429 | consumed samples: 3545600 | consumed tokens: 7261388800 | elapsed time per iteration (s): 1.19 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.241374E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.319 | TFLOPs: 35.58 | 15: iteration 13860/ 125429 | consumed samples: 3548160 | consumed tokens: 7266631680 | elapsed time per iteration (s): 1.03 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.273523E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.998 | TFLOPs: 40.98 | 15: iteration 13870/ 125429 | consumed samples: 3550720 | consumed tokens: 7271874560 | elapsed time per iteration (s): 1.28 | learning rate: 1.955E-04 | global batch size: 256 | lm loss: 2.239790E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 200.675 | TFLOPs: 33.16 | 15: iteration 13880/ 125429 | consumed samples: 3553280 | consumed tokens: 7277117440 | elapsed time per iteration (s): 1.05 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.200188E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.490 | TFLOPs: 40.24 | 15: iteration 13890/ 125429 | consumed samples: 3555840 | consumed tokens: 7282360320 | elapsed time per iteration (s): 1.05 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.225026E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.791 | TFLOPs: 40.29 | 15: iteration 13900/ 125429 | consumed samples: 3558400 | consumed tokens: 7287603200 | elapsed time per iteration (s): 1.05 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.210518E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.355 | TFLOPs: 40.38 | 15: iteration 13910/ 125429 | consumed samples: 3560960 | consumed tokens: 7292846080 | elapsed time per iteration (s): 1.69 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.246169E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 151.164 | TFLOPs: 24.98 | 15: iteration 13920/ 125429 | consumed samples: 3563520 | consumed tokens: 7298088960 | elapsed time per iteration (s): 1.04 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.215654E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.693 | TFLOPs: 40.77 | 15: iteration 13930/ 125429 | consumed samples: 3566080 | consumed tokens: 7303331840 | elapsed time per iteration (s): 1.09 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.229605E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.344 | TFLOPs: 38.89 | 15: iteration 13940/ 125429 | consumed samples: 3568640 | consumed tokens: 7308574720 | elapsed time per iteration (s): 1.04 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.251677E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.320 | TFLOPs: 40.54 | 15: iteration 13950/ 125429 | consumed samples: 3571200 | consumed tokens: 7313817600 | elapsed time per iteration (s): 1.05 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.245841E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.770 | TFLOPs: 40.45 | 15: iteration 13960/ 125429 | consumed samples: 3573760 | consumed tokens: 7319060480 | elapsed time per iteration (s): 1.03 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.230562E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.202 | TFLOPs: 41.02 | 15: iteration 13970/ 125429 | consumed samples: 3576320 | consumed tokens: 7324303360 | elapsed time per iteration (s): 1.07 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.218156E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.134 | TFLOPs: 39.52 | 15: iteration 13980/ 125429 | consumed samples: 3578880 | consumed tokens: 7329546240 | elapsed time per iteration (s): 1.04 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.231543E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.645 | TFLOPs: 40.59 | 15: iteration 13990/ 125429 | consumed samples: 3581440 | consumed tokens: 7334789120 | elapsed time per iteration (s): 1.03 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.201170E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.605 | TFLOPs: 41.08 | 0: [2022-11-25 23:57:55,054] [INFO] [logging.py:68:log_dist] [Rank 0] step=14000, skipped=0, lr=[0.000195361184097867, 0.000195361184097867, 0.000195361184097867], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 14000/ 125429 | consumed samples: 3584000 | consumed tokens: 7340032000 | elapsed time per iteration (s): 1.09 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.207621E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.799 | TFLOPs: 38.80 | 0: steps: 14000 loss: 2.2260 iter time (s): 1.061 samples/sec: 241.356 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 14000 | lm loss value: 2.172874E+00 | lm loss PPL: 8.783493E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 14000 to checkpoints_1b5 0: [2022-11-25 23:57:55,435] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step14000 is begin to save! 0: [2022-11-25 23:57:55,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_01-model_00-model_states.pt... 0: [2022-11-25 23:57:55,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_01-model_00-model_states.pt. 0: [2022-11-25 23:57:55,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_03-model_00-model_states.pt... 0: [2022-11-25 23:57:55,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_03-model_00-model_states.pt. 0: [2022-11-25 23:57:55,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_04-model_00-model_states.pt... 0: [2022-11-25 23:57:55,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_04-model_00-model_states.pt. 0: [2022-11-25 23:57:55,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_05-model_00-model_states.pt... 0: [2022-11-25 23:57:56,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_05-model_00-model_states.pt. 0: [2022-11-25 23:57:56,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_06-model_00-model_states.pt... 0: [2022-11-25 23:57:56,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_06-model_00-model_states.pt. 0: [2022-11-25 23:57:56,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_07-model_00-model_states.pt... 0: [2022-11-25 23:57:56,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_07-model_00-model_states.pt. 0: [2022-11-25 23:57:56,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_08-model_00-model_states.pt... 0: [2022-11-25 23:57:56,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_08-model_00-model_states.pt. 0: [2022-11-25 23:57:56,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_09-model_00-model_states.pt... 0: [2022-11-25 23:57:56,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_09-model_00-model_states.pt. 0: [2022-11-25 23:57:56,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_10-model_00-model_states.pt... 0: [2022-11-25 23:57:56,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_10-model_00-model_states.pt. 0: [2022-11-25 23:57:56,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_11-model_00-model_states.pt... 0: [2022-11-25 23:57:56,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_11-model_00-model_states.pt. 0: [2022-11-25 23:57:56,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_12-model_00-model_states.pt... 0: [2022-11-25 23:57:56,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_12-model_00-model_states.pt. 0: [2022-11-25 23:57:56,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_13-model_00-model_states.pt... 0: [2022-11-25 23:57:56,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_13-model_00-model_states.pt. 0: [2022-11-25 23:57:56,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_14-model_00-model_states.pt... 0: [2022-11-25 23:57:56,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_14-model_00-model_states.pt. 0: [2022-11-25 23:57:56,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_15-model_00-model_states.pt... 0: [2022-11-25 23:57:57,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_15-model_00-model_states.pt. 0: [2022-11-25 23:57:57,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_16-model_00-model_states.pt... 0: [2022-11-25 23:57:57,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_16-model_00-model_states.pt. 0: [2022-11-25 23:57:57,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_17-model_00-model_states.pt... 0: [2022-11-25 23:57:57,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_17-model_00-model_states.pt. 0: [2022-11-25 23:57:57,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_18-model_00-model_states.pt... 0: [2022-11-25 23:57:57,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_18-model_00-model_states.pt. 0: [2022-11-25 23:57:57,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_19-model_00-model_states.pt... 0: [2022-11-25 23:57:57,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_19-model_00-model_states.pt. 0: [2022-11-25 23:57:57,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_20-model_00-model_states.pt... 0: [2022-11-25 23:57:57,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_20-model_00-model_states.pt. 0: [2022-11-25 23:57:57,636] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_21-model_00-model_states.pt... 0: [2022-11-25 23:57:57,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_21-model_00-model_states.pt. 0: [2022-11-25 23:57:57,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_22-model_00-model_states.pt... 0: [2022-11-25 23:57:57,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_22-model_00-model_states.pt. 0: [2022-11-25 23:57:57,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_23-model_00-model_states.pt... 0: [2022-11-25 23:57:57,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_23-model_00-model_states.pt. 0: [2022-11-25 23:57:57,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_24-model_00-model_states.pt... 0: [2022-11-25 23:57:58,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_24-model_00-model_states.pt. 0: [2022-11-25 23:57:58,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_25-model_00-model_states.pt... 0: [2022-11-25 23:57:58,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_25-model_00-model_states.pt. 0: [2022-11-25 23:57:58,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_26-model_00-model_states.pt... 0: [2022-11-25 23:57:58,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_26-model_00-model_states.pt. 0: [2022-11-25 23:57:58,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_27-model_00-model_states.pt... 0: [2022-11-25 23:57:58,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_27-model_00-model_states.pt. 0: [2022-11-25 23:57:58,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_28-model_00-model_states.pt... 0: [2022-11-25 23:57:58,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_28-model_00-model_states.pt. 0: [2022-11-25 23:57:58,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_29-model_00-model_states.pt... 0: [2022-11-25 23:57:58,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_29-model_00-model_states.pt. 0: [2022-11-25 23:57:58,626] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_30-model_00-model_states.pt... 0: [2022-11-25 23:57:58,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_30-model_00-model_states.pt. 0: [2022-11-25 23:57:58,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/layer_32-model_00-model_states.pt... 0: [2022-11-25 23:57:58,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/layer_32-model_00-model_states.pt. 0: [2022-11-25 23:57:58,741] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step14000/mp_rank_00_model_states.pt 0: [2022-11-25 23:57:58,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/mp_rank_00_model_states.pt... 0: [2022-11-25 23:57:58,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/mp_rank_00_model_states.pt. 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 9: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step14000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-25 23:57:58,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-25 23:57:58,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-25 23:57:58,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:58,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:58,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:58,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 23:57:58,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:58,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-25 23:57:58,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:57:58,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:57:58,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:58,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:58,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:58,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:58,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:58,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-25 23:57:58,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:58,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:58,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-25 23:57:58,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 23:57:58,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:58,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 23:57:58,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:58,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 23:57:58,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:58,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-25 23:57:58,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-25 23:57:58,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-25 23:57:58,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-25 23:57:58,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:57:58,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:57:58,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-25 23:57:58,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 23:57:58,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-25 23:57:58,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:57:58,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:57:58,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:57:58,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 23:57:58,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 23:57:58,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:57:58,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-25 23:57:58,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 23:57:58,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 23:57:58,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:58,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:58,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:58,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:57:58,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:58,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:58,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:57:58,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:58,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:58,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:57:58,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:58,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:58,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:57:58,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:58,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 12: [2022-11-25 23:57:58,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:57:58,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 12: [2022-11-25 23:57:58,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 1: [2022-11-25 23:57:58,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:58,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 23:57:58,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:58,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:58,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:58,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:58,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:58,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:58,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 23:57:58,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:58,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:58,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-25 23:57:58,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:58,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:58,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:58,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 5: [2022-11-25 23:57:58,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-25 23:57:58,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-25 23:57:58,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-25 23:57:58,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-25 23:57:58,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-25 23:57:58,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 1: [2022-11-25 23:57:58,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-25 23:57:58,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 10: [2022-11-25 23:57:58,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-25 23:57:58,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-25 23:57:58,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:58,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:58,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-25 23:57:58,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-25 23:57:58,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 11: [2022-11-25 23:57:58,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:59,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:59,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:59,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 13: [2022-11-25 23:57:59,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-25 23:57:59,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-25 23:57:59,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 23:57:58,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 23:57:58,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 23:57:58,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 4: [2022-11-25 23:57:58,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-25 23:57:58,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-25 23:57:58,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:59,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:57:59,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:59,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:59,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:59,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:59,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:59,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:59,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:59,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-25 23:57:59,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-25 23:57:59,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-25 23:57:59,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-25 23:57:59,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:59,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 8: [2022-11-25 23:57:59,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:59,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:59,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:59,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 2: [2022-11-25 23:57:59,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-25 23:57:59,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-25 23:57:59,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 7: [2022-11-25 23:57:59,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-25 23:57:59,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-25 23:57:59,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:59,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:57:59,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-25 23:57:59,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:59,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-25 23:57:59,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 15: [2022-11-25 23:57:59,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 14: [2022-11-25 23:57:58,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-25 23:57:58,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-25 23:57:58,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 6: [2022-11-25 23:57:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 3: [2022-11-25 23:57:59,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-25 23:57:59,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-25 23:57:59,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: [2022-11-25 23:57:59,200] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-25 23:57:59,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step14000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 9: [2022-11-25 23:57:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step14000 is ready now! 0: successfully saved checkpoint at iteration 14000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3807.95 15: iteration 14010/ 125429 | consumed samples: 3586560 | consumed tokens: 7345274880 | elapsed time per iteration (s): 1.43 | learning rate: 1.954E-04 | global batch size: 256 | lm loss: 2.262769E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.400 | TFLOPs: 29.48 | 15: iteration 14020/ 125429 | consumed samples: 3589120 | consumed tokens: 7350517760 | elapsed time per iteration (s): 1.09 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.240754E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.105 | TFLOPs: 38.85 | 15: iteration 14030/ 125429 | consumed samples: 3591680 | consumed tokens: 7355760640 | elapsed time per iteration (s): 1.07 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.246772E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.883 | TFLOPs: 39.64 | 15: iteration 14040/ 125429 | consumed samples: 3594240 | consumed tokens: 7361003520 | elapsed time per iteration (s): 1.08 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.220019E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.229 | TFLOPs: 39.04 | 15: iteration 14050/ 125429 | consumed samples: 3596800 | consumed tokens: 7366246400 | elapsed time per iteration (s): 1.04 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.209667E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.041 | TFLOPs: 40.83 | 15: iteration 14060/ 125429 | consumed samples: 3599360 | consumed tokens: 7371489280 | elapsed time per iteration (s): 1.04 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.210089E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.538 | TFLOPs: 40.58 | 15: iteration 14070/ 125429 | consumed samples: 3601920 | consumed tokens: 7376732160 | elapsed time per iteration (s): 1.05 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.193174E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.020 | TFLOPs: 40.16 | 15: iteration 14080/ 125429 | consumed samples: 3604480 | consumed tokens: 7381975040 | elapsed time per iteration (s): 3.09 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.241132E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 82.764 | TFLOPs: 13.68 | 15: iteration 14090/ 125429 | consumed samples: 3607040 | consumed tokens: 7387217920 | elapsed time per iteration (s): 1.03 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.201940E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.542 | TFLOPs: 41.07 | 15: iteration 14100/ 125429 | consumed samples: 3609600 | consumed tokens: 7392460800 | elapsed time per iteration (s): 1.04 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.232309E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.650 | TFLOPs: 40.60 | 15: iteration 14110/ 125429 | consumed samples: 3612160 | consumed tokens: 7397703680 | elapsed time per iteration (s): 1.13 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.234380E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.357 | TFLOPs: 37.41 | 15: iteration 14120/ 125429 | consumed samples: 3614720 | consumed tokens: 7402946560 | elapsed time per iteration (s): 1.10 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.233041E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.407 | TFLOPs: 38.41 | 15: iteration 14130/ 125429 | consumed samples: 3617280 | consumed tokens: 7408189440 | elapsed time per iteration (s): 1.07 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.252081E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.588 | TFLOPs: 39.43 | 15: iteration 14140/ 125429 | consumed samples: 3619840 | consumed tokens: 7413432320 | elapsed time per iteration (s): 1.04 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.206664E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.217 | TFLOPs: 40.52 | 15: iteration 14150/ 125429 | consumed samples: 3622400 | consumed tokens: 7418675200 | elapsed time per iteration (s): 1.03 | learning rate: 1.953E-04 | global batch size: 256 | lm loss: 2.245907E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.086 | TFLOPs: 41.16 | 15: iteration 14160/ 125429 | consumed samples: 3624960 | consumed tokens: 7423918080 | elapsed time per iteration (s): 1.03 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.257089E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.555 | TFLOPs: 40.91 | 15: iteration 14170/ 125429 | consumed samples: 3627520 | consumed tokens: 7429160960 | elapsed time per iteration (s): 1.07 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.219308E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.239 | TFLOPs: 39.37 | 15: iteration 14180/ 125429 | consumed samples: 3630080 | consumed tokens: 7434403840 | elapsed time per iteration (s): 1.05 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.262248E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.599 | TFLOPs: 40.26 | 15: iteration 14190/ 125429 | consumed samples: 3632640 | consumed tokens: 7439646720 | elapsed time per iteration (s): 1.05 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.225623E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.023 | TFLOPs: 40.33 | 15: iteration 14200/ 125429 | consumed samples: 3635200 | consumed tokens: 7444889600 | elapsed time per iteration (s): 1.03 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.228814E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.684 | TFLOPs: 41.26 | 15: iteration 14210/ 125429 | consumed samples: 3637760 | consumed tokens: 7450132480 | elapsed time per iteration (s): 1.09 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.206259E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.891 | TFLOPs: 38.98 | 15: iteration 14220/ 125429 | consumed samples: 3640320 | consumed tokens: 7455375360 | elapsed time per iteration (s): 1.04 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.239455E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.325 | TFLOPs: 40.54 | 15: iteration 14230/ 125429 | consumed samples: 3642880 | consumed tokens: 7460618240 | elapsed time per iteration (s): 1.03 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.220916E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.830 | TFLOPs: 40.96 | 15: iteration 14240/ 125429 | consumed samples: 3645440 | consumed tokens: 7465861120 | elapsed time per iteration (s): 1.07 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.239122E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.767 | TFLOPs: 39.62 | 15: iteration 14250/ 125429 | consumed samples: 3648000 | consumed tokens: 7471104000 | elapsed time per iteration (s): 1.07 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.235425E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.982 | TFLOPs: 39.66 | 15: iteration 14260/ 125429 | consumed samples: 3650560 | consumed tokens: 7476346880 | elapsed time per iteration (s): 1.10 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.248870E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.317 | TFLOPs: 38.39 | 15: iteration 14270/ 125429 | consumed samples: 3653120 | consumed tokens: 7481589760 | elapsed time per iteration (s): 1.06 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.207138E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.953 | TFLOPs: 39.82 | 15: iteration 14280/ 125429 | consumed samples: 3655680 | consumed tokens: 7486832640 | elapsed time per iteration (s): 1.03 | learning rate: 1.952E-04 | global batch size: 256 | lm loss: 2.220540E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.443 | TFLOPs: 40.89 | 15: iteration 14290/ 125429 | consumed samples: 3658240 | consumed tokens: 7492075520 | elapsed time per iteration (s): 1.05 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.236194E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.207 | TFLOPs: 40.36 | 15: iteration 14300/ 125429 | consumed samples: 3660800 | consumed tokens: 7497318400 | elapsed time per iteration (s): 1.05 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.183952E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.024 | TFLOPs: 40.16 | 15: iteration 14310/ 125429 | consumed samples: 3663360 | consumed tokens: 7502561280 | elapsed time per iteration (s): 1.05 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.187029E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.662 | TFLOPs: 40.27 | 15: iteration 14320/ 125429 | consumed samples: 3665920 | consumed tokens: 7507804160 | elapsed time per iteration (s): 1.03 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.191736E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.318 | TFLOPs: 41.20 | 15: iteration 14330/ 125429 | consumed samples: 3668480 | consumed tokens: 7513047040 | elapsed time per iteration (s): 1.10 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.204671E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.783 | TFLOPs: 38.47 | 15: iteration 14340/ 125429 | consumed samples: 3671040 | consumed tokens: 7518289920 | elapsed time per iteration (s): 1.07 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.224324E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.128 | TFLOPs: 39.68 | 15: iteration 14350/ 125429 | consumed samples: 3673600 | consumed tokens: 7523532800 | elapsed time per iteration (s): 1.05 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.247077E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.683 | TFLOPs: 40.44 | 15: iteration 14360/ 125429 | consumed samples: 3676160 | consumed tokens: 7528775680 | elapsed time per iteration (s): 1.04 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.251894E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.498 | TFLOPs: 40.57 | 15: iteration 14370/ 125429 | consumed samples: 3678720 | consumed tokens: 7534018560 | elapsed time per iteration (s): 1.08 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.249701E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.666 | TFLOPs: 39.11 | 15: iteration 14380/ 125429 | consumed samples: 3681280 | consumed tokens: 7539261440 | elapsed time per iteration (s): 1.10 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.208235E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.064 | TFLOPs: 38.52 | 15: iteration 14390/ 125429 | consumed samples: 3683840 | consumed tokens: 7544504320 | elapsed time per iteration (s): 1.05 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.233253E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.802 | TFLOPs: 40.46 | 15: iteration 14400/ 125429 | consumed samples: 3686400 | consumed tokens: 7549747200 | elapsed time per iteration (s): 1.06 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.188837E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.153 | TFLOPs: 39.85 | 15: iteration 14410/ 125429 | consumed samples: 3688960 | consumed tokens: 7554990080 | elapsed time per iteration (s): 1.03 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.186568E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.424 | TFLOPs: 40.89 | 15: iteration 14420/ 125429 | consumed samples: 3691520 | consumed tokens: 7560232960 | elapsed time per iteration (s): 1.04 | learning rate: 1.951E-04 | global batch size: 256 | lm loss: 2.216086E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.320 | TFLOPs: 40.87 | 15: iteration 14430/ 125429 | consumed samples: 3694080 | consumed tokens: 7565475840 | elapsed time per iteration (s): 1.05 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.226563E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.618 | TFLOPs: 40.26 | 15: iteration 14440/ 125429 | consumed samples: 3696640 | consumed tokens: 7570718720 | elapsed time per iteration (s): 1.07 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.189654E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.302 | TFLOPs: 39.55 | 15: iteration 14450/ 125429 | consumed samples: 3699200 | consumed tokens: 7575961600 | elapsed time per iteration (s): 1.04 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.208486E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.786 | TFLOPs: 40.78 | 15: iteration 14460/ 125429 | consumed samples: 3701760 | consumed tokens: 7581204480 | elapsed time per iteration (s): 1.02 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.234191E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.114 | TFLOPs: 41.33 | 15: iteration 14470/ 125429 | consumed samples: 3704320 | consumed tokens: 7586447360 | elapsed time per iteration (s): 1.05 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.238971E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.352 | TFLOPs: 40.22 | 15: iteration 14480/ 125429 | consumed samples: 3706880 | consumed tokens: 7591690240 | elapsed time per iteration (s): 1.03 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.201755E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.823 | TFLOPs: 40.95 | 15: iteration 14490/ 125429 | consumed samples: 3709440 | consumed tokens: 7596933120 | elapsed time per iteration (s): 1.03 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.200484E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.979 | TFLOPs: 41.15 | 15: iteration 14500/ 125429 | consumed samples: 3712000 | consumed tokens: 7602176000 | elapsed time per iteration (s): 1.04 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.219771E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.474 | TFLOPs: 40.73 | 15: iteration 14510/ 125429 | consumed samples: 3714560 | consumed tokens: 7607418880 | elapsed time per iteration (s): 1.07 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.233757E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.227 | TFLOPs: 39.53 | 15: iteration 14520/ 125429 | consumed samples: 3717120 | consumed tokens: 7612661760 | elapsed time per iteration (s): 1.04 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.271096E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.558 | TFLOPs: 40.75 | 15: iteration 14530/ 125429 | consumed samples: 3719680 | consumed tokens: 7617904640 | elapsed time per iteration (s): 1.08 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.215101E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.977 | TFLOPs: 39.33 | 15: iteration 14540/ 125429 | consumed samples: 3722240 | consumed tokens: 7623147520 | elapsed time per iteration (s): 1.04 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.227193E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.103 | TFLOPs: 40.67 | 15: iteration 14550/ 125429 | consumed samples: 3724800 | consumed tokens: 7628390400 | elapsed time per iteration (s): 1.06 | learning rate: 1.950E-04 | global batch size: 256 | lm loss: 2.194304E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.428 | TFLOPs: 39.90 | 15: iteration 14560/ 125429 | consumed samples: 3727360 | consumed tokens: 7633633280 | elapsed time per iteration (s): 1.05 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.210986E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.528 | TFLOPs: 40.41 | 15: iteration 14570/ 125429 | consumed samples: 3729920 | consumed tokens: 7638876160 | elapsed time per iteration (s): 1.03 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.215465E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.750 | TFLOPs: 41.27 | 15: iteration 14580/ 125429 | consumed samples: 3732480 | consumed tokens: 7644119040 | elapsed time per iteration (s): 1.03 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.194060E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.739 | TFLOPs: 40.94 | 15: iteration 14590/ 125429 | consumed samples: 3735040 | consumed tokens: 7649361920 | elapsed time per iteration (s): 1.05 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.231595E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.847 | TFLOPs: 40.46 | 15: iteration 14600/ 125429 | consumed samples: 3737600 | consumed tokens: 7654604800 | elapsed time per iteration (s): 1.02 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.242327E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.384 | TFLOPs: 41.38 | 15: iteration 14610/ 125429 | consumed samples: 3740160 | consumed tokens: 7659847680 | elapsed time per iteration (s): 1.04 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.223503E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.923 | TFLOPs: 40.81 | 15: iteration 14620/ 125429 | consumed samples: 3742720 | consumed tokens: 7665090560 | elapsed time per iteration (s): 1.06 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.234967E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.693 | TFLOPs: 39.78 | 15: iteration 14630/ 125429 | consumed samples: 3745280 | consumed tokens: 7670333440 | elapsed time per iteration (s): 1.05 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.194744E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.470 | TFLOPs: 40.40 | 15: iteration 14640/ 125429 | consumed samples: 3747840 | consumed tokens: 7675576320 | elapsed time per iteration (s): 1.07 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.205150E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.076 | TFLOPs: 39.51 | 15: iteration 14650/ 125429 | consumed samples: 3750400 | consumed tokens: 7680819200 | elapsed time per iteration (s): 1.03 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.223176E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.801 | TFLOPs: 41.12 | 15: iteration 14660/ 125429 | consumed samples: 3752960 | consumed tokens: 7686062080 | elapsed time per iteration (s): 1.03 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.198543E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.062 | TFLOPs: 41.16 | 15: iteration 14670/ 125429 | consumed samples: 3755520 | consumed tokens: 7691304960 | elapsed time per iteration (s): 1.07 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.209780E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.348 | TFLOPs: 39.72 | 15: iteration 14680/ 125429 | consumed samples: 3758080 | consumed tokens: 7696547840 | elapsed time per iteration (s): 1.04 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.235531E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.560 | TFLOPs: 40.58 | 15: iteration 14690/ 125429 | consumed samples: 3760640 | consumed tokens: 7701790720 | elapsed time per iteration (s): 1.04 | learning rate: 1.949E-04 | global batch size: 256 | lm loss: 2.210587E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.704 | TFLOPs: 40.60 | 15: iteration 14700/ 125429 | consumed samples: 3763200 | consumed tokens: 7707033600 | elapsed time per iteration (s): 1.07 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.240441E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.209 | TFLOPs: 39.37 | 15: iteration 14710/ 125429 | consumed samples: 3765760 | consumed tokens: 7712276480 | elapsed time per iteration (s): 1.06 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.211358E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.083 | TFLOPs: 39.84 | 15: iteration 14720/ 125429 | consumed samples: 3768320 | consumed tokens: 7717519360 | elapsed time per iteration (s): 1.08 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.226568E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.543 | TFLOPs: 39.26 | 15: iteration 14730/ 125429 | consumed samples: 3770880 | consumed tokens: 7722762240 | elapsed time per iteration (s): 1.04 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.339697E+00 | grad norm: 3.391 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.588 | TFLOPs: 40.75 | 15: iteration 14740/ 125429 | consumed samples: 3773440 | consumed tokens: 7728005120 | elapsed time per iteration (s): 1.09 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 3.249934E+00 | grad norm: 11.825 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.804 | TFLOPs: 38.80 | 15: iteration 14750/ 125429 | consumed samples: 3776000 | consumed tokens: 7733248000 | elapsed time per iteration (s): 1.09 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.790395E+00 | grad norm: 0.983 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.302 | TFLOPs: 38.89 | 15: iteration 14760/ 125429 | consumed samples: 3778560 | consumed tokens: 7738490880 | elapsed time per iteration (s): 1.10 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.451608E+00 | grad norm: 1.554 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.497 | TFLOPs: 38.59 | 15: iteration 14770/ 125429 | consumed samples: 3781120 | consumed tokens: 7743733760 | elapsed time per iteration (s): 1.04 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.364218E+00 | grad norm: 0.234 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.805 | TFLOPs: 40.62 | 15: iteration 14780/ 125429 | consumed samples: 3783680 | consumed tokens: 7748976640 | elapsed time per iteration (s): 1.02 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.329712E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.022 | TFLOPs: 41.48 | 15: iteration 14790/ 125429 | consumed samples: 3786240 | consumed tokens: 7754219520 | elapsed time per iteration (s): 1.03 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.295395E+00 | grad norm: 0.192 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.357 | TFLOPs: 40.88 | 15: iteration 14800/ 125429 | consumed samples: 3788800 | consumed tokens: 7759462400 | elapsed time per iteration (s): 1.06 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.250857E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.749 | TFLOPs: 39.95 | 15: iteration 14810/ 125429 | consumed samples: 3791360 | consumed tokens: 7764705280 | elapsed time per iteration (s): 1.06 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.280187E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.858 | TFLOPs: 39.80 | 15: iteration 14820/ 125429 | consumed samples: 3793920 | consumed tokens: 7769948160 | elapsed time per iteration (s): 1.04 | learning rate: 1.948E-04 | global batch size: 256 | lm loss: 2.224492E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.133 | TFLOPs: 40.68 | 15: iteration 14830/ 125429 | consumed samples: 3796480 | consumed tokens: 7775191040 | elapsed time per iteration (s): 1.05 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.248540E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.142 | TFLOPs: 40.18 | 15: iteration 14840/ 125429 | consumed samples: 3799040 | consumed tokens: 7780433920 | elapsed time per iteration (s): 1.05 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.228469E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.105 | TFLOPs: 40.34 | 15: iteration 14850/ 125429 | consumed samples: 3801600 | consumed tokens: 7785676800 | elapsed time per iteration (s): 1.05 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.224110E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.139 | TFLOPs: 40.35 | 15: iteration 14860/ 125429 | consumed samples: 3804160 | consumed tokens: 7790919680 | elapsed time per iteration (s): 1.37 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.214041E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 187.331 | TFLOPs: 30.96 | 15: iteration 14870/ 125429 | consumed samples: 3806720 | consumed tokens: 7796162560 | elapsed time per iteration (s): 1.03 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.252435E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.439 | TFLOPs: 41.06 | 15: iteration 14880/ 125429 | consumed samples: 3809280 | consumed tokens: 7801405440 | elapsed time per iteration (s): 1.04 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.220754E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.077 | TFLOPs: 40.67 | 15: iteration 14890/ 125429 | consumed samples: 3811840 | consumed tokens: 7806648320 | elapsed time per iteration (s): 1.07 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.207221E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.518 | TFLOPs: 39.42 | 15: iteration 14900/ 125429 | consumed samples: 3814400 | consumed tokens: 7811891200 | elapsed time per iteration (s): 1.10 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.205870E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.407 | TFLOPs: 38.57 | 15: iteration 14910/ 125429 | consumed samples: 3816960 | consumed tokens: 7817134080 | elapsed time per iteration (s): 1.02 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.244043E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.153 | TFLOPs: 41.34 | 15: iteration 14920/ 125429 | consumed samples: 3819520 | consumed tokens: 7822376960 | elapsed time per iteration (s): 1.04 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.205468E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.343 | TFLOPs: 40.71 | 15: iteration 14930/ 125429 | consumed samples: 3822080 | consumed tokens: 7827619840 | elapsed time per iteration (s): 1.04 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.210057E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.974 | TFLOPs: 40.81 | 15: iteration 14940/ 125429 | consumed samples: 3824640 | consumed tokens: 7832862720 | elapsed time per iteration (s): 1.05 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.229230E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.786 | TFLOPs: 40.45 | 15: iteration 14950/ 125429 | consumed samples: 3827200 | consumed tokens: 7838105600 | elapsed time per iteration (s): 1.06 | learning rate: 1.947E-04 | global batch size: 256 | lm loss: 2.203087E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.415 | TFLOPs: 39.73 | 15: iteration 14960/ 125429 | consumed samples: 3829760 | consumed tokens: 7843348480 | elapsed time per iteration (s): 1.03 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.178685E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.407 | TFLOPs: 41.05 | 15: iteration 14970/ 125429 | consumed samples: 3832320 | consumed tokens: 7848591360 | elapsed time per iteration (s): 1.08 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.214400E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.702 | TFLOPs: 39.28 | 15: iteration 14980/ 125429 | consumed samples: 3834880 | consumed tokens: 7853834240 | elapsed time per iteration (s): 1.03 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.221597E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.700 | TFLOPs: 40.93 | 15: iteration 14990/ 125429 | consumed samples: 3837440 | consumed tokens: 7859077120 | elapsed time per iteration (s): 1.03 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.262803E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.351 | TFLOPs: 40.88 | 15: iteration 15000/ 125429 | consumed samples: 3840000 | consumed tokens: 7864320000 | elapsed time per iteration (s): 1.06 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.209179E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.421 | TFLOPs: 40.06 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 15000 | lm loss value: 2.220967E+00 | lm loss PPL: 9.216241E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 15000 to checkpoints_1b5 0: [2022-11-26 00:15:56,144] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step15000 is begin to save! 0: [2022-11-26 00:15:56,152] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:15:56,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:15:56,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:15:56,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:15:56,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:15:56,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:15:56,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:15:56,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:15:56,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:15:56,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:15:56,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:15:56,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:15:56,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:15:57,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:15:57,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:15:57,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:15:57,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:15:57,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:15:57,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:15:57,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:15:57,333] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:15:57,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:15:57,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:15:57,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:15:57,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:15:57,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:15:57,636] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:15:57,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:15:57,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:15:57,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:15:57,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:15:57,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:15:57,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:15:58,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:15:58,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:15:58,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:15:58,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:15:58,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:15:58,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:15:58,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:15:58,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:15:58,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:15:58,467] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:15:58,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:15:58,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:15:58,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:15:58,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:15:58,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:15:58,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:15:58,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:15:58,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:15:58,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:15:58,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:15:59,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:15:59,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_29-model_00-model_states.pt... 0: [2022-11-26 00:15:59,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_29-model_00-model_states.pt. 0: [2022-11-26 00:15:59,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:15:59,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:15:59,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/layer_32-model_00-model_states.pt... 0: [2022-11-26 00:15:59,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/layer_32-model_00-model_states.pt. 0: [2022-11-26 00:15:59,296] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step15000/mp_rank_00_model_states.pt 0: [2022-11-26 00:15:59,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:15:59,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:15:59,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step15000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:15:59,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:15:59,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:15:59,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 00:15:59,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 00:15:59,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 00:15:59,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:15:59,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 00:15:59,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-26 00:15:59,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 00:15:59,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:15:59,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:15:59,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:15:59,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:15:59,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 00:15:59,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 00:15:59,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:15:59,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:15:59,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-26 00:15:59,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:15:59,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 00:15:59,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:15:59,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:15:59,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:15:59,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:15:59,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 00:15:59,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 00:15:59,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:15:59,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:15:59,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 4: [2022-11-26 00:15:59,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 14: [2022-11-26 00:15:59,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 00:15:59,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:15:59,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:15:59,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-26 00:15:59,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-26 00:15:59,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 00:15:59,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 15: [2022-11-26 00:15:59,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 00:15:59,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:15:59,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:15:59,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:15:59,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 1: [2022-11-26 00:15:59,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:15:59,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 00:15:59,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 6: [2022-11-26 00:15:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:15:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:15:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-26 00:15:59,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 00:15:59,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 11: [2022-11-26 00:15:59,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:15:59,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 00:15:59,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:15:59,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:15:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 10: [2022-11-26 00:15:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 13: [2022-11-26 00:15:59,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:15:59,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 00:15:59,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 00:15:59,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 12: [2022-11-26 00:15:59,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:15:59,549] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 00:15:59,549] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-26 00:15:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 8: [2022-11-26 00:15:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:15:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 00:15:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 9: [2022-11-26 00:15:59,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:15:59,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 00:15:59,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 14: [2022-11-26 00:15:59,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:15:59,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 00:15:59,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 00:15:59,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 00:15:59,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 5: [2022-11-26 00:15:59,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:15:59,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:15:59,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 00:15:59,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:15:59,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 00:15:59,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 3: [2022-11-26 00:15:59,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:15:59,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:15:59,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 00:15:59,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 2: [2022-11-26 00:15:59,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:15:59,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 00:15:59,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 4: [2022-11-26 00:15:59,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:15:59,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:15:59,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: [2022-11-26 00:15:59,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 00:15:59,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 7: [2022-11-26 00:15:59,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:15:59,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step15000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:15:59,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step15000 is ready now! 0: successfully saved checkpoint at iteration 15000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3627.14 15: iteration 15010/ 125429 | consumed samples: 3842560 | consumed tokens: 7869562880 | elapsed time per iteration (s): 1.43 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.224053E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.353 | TFLOPs: 29.64 | 15: iteration 15020/ 125429 | consumed samples: 3845120 | consumed tokens: 7874805760 | elapsed time per iteration (s): 1.05 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.205666E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.129 | TFLOPs: 40.34 | 15: iteration 15030/ 125429 | consumed samples: 3847680 | consumed tokens: 7880048640 | elapsed time per iteration (s): 1.04 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.231864E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.938 | TFLOPs: 40.64 | 15: iteration 15040/ 125429 | consumed samples: 3850240 | consumed tokens: 7885291520 | elapsed time per iteration (s): 1.05 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.255884E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.960 | TFLOPs: 40.32 | 15: iteration 15050/ 125429 | consumed samples: 3852800 | consumed tokens: 7890534400 | elapsed time per iteration (s): 1.04 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.198298E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.012 | TFLOPs: 40.49 | 15: iteration 15060/ 125429 | consumed samples: 3855360 | consumed tokens: 7895777280 | elapsed time per iteration (s): 1.04 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.196574E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.272 | TFLOPs: 40.53 | 15: iteration 15070/ 125429 | consumed samples: 3857920 | consumed tokens: 7901020160 | elapsed time per iteration (s): 1.06 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.213679E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.410 | TFLOPs: 40.06 | 15: iteration 15080/ 125429 | consumed samples: 3860480 | consumed tokens: 7906263040 | elapsed time per iteration (s): 1.09 | learning rate: 1.946E-04 | global batch size: 256 | lm loss: 2.199392E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.802 | TFLOPs: 38.80 | 15: iteration 15090/ 125429 | consumed samples: 3863040 | consumed tokens: 7911505920 | elapsed time per iteration (s): 1.07 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.201978E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.388 | TFLOPs: 39.40 | 15: iteration 15100/ 125429 | consumed samples: 3865600 | consumed tokens: 7916748800 | elapsed time per iteration (s): 1.04 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.201172E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.370 | TFLOPs: 40.55 | 15: iteration 15110/ 125429 | consumed samples: 3868160 | consumed tokens: 7921991680 | elapsed time per iteration (s): 1.09 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.222788E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.500 | TFLOPs: 38.92 | 15: iteration 15120/ 125429 | consumed samples: 3870720 | consumed tokens: 7927234560 | elapsed time per iteration (s): 1.26 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.210410E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 203.933 | TFLOPs: 33.70 | 15: iteration 15130/ 125429 | consumed samples: 3873280 | consumed tokens: 7932477440 | elapsed time per iteration (s): 1.07 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.193499E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.187 | TFLOPs: 39.53 | 15: iteration 15140/ 125429 | consumed samples: 3875840 | consumed tokens: 7937720320 | elapsed time per iteration (s): 1.03 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.234887E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.706 | TFLOPs: 40.94 | 15: iteration 15150/ 125429 | consumed samples: 3878400 | consumed tokens: 7942963200 | elapsed time per iteration (s): 1.03 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.191655E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.137 | TFLOPs: 41.17 | 15: iteration 15160/ 125429 | consumed samples: 3880960 | consumed tokens: 7948206080 | elapsed time per iteration (s): 1.05 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.208279E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.575 | TFLOPs: 40.25 | 15: iteration 15170/ 125429 | consumed samples: 3883520 | consumed tokens: 7953448960 | elapsed time per iteration (s): 1.49 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.225702E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.183 | TFLOPs: 28.45 | 15: iteration 15180/ 125429 | consumed samples: 3886080 | consumed tokens: 7958691840 | elapsed time per iteration (s): 1.06 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.202409E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.532 | TFLOPs: 40.08 | 15: iteration 15190/ 125429 | consumed samples: 3888640 | consumed tokens: 7963934720 | elapsed time per iteration (s): 1.07 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.204725E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.765 | TFLOPs: 39.62 | 15: iteration 15200/ 125429 | consumed samples: 3891200 | consumed tokens: 7969177600 | elapsed time per iteration (s): 1.05 | learning rate: 1.945E-04 | global batch size: 256 | lm loss: 2.226498E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.907 | TFLOPs: 40.47 | 15: iteration 15210/ 125429 | consumed samples: 3893760 | consumed tokens: 7974420480 | elapsed time per iteration (s): 1.04 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.203143E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.818 | TFLOPs: 40.79 | 15: iteration 15220/ 125429 | consumed samples: 3896320 | consumed tokens: 7979663360 | elapsed time per iteration (s): 1.03 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.229561E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.301 | TFLOPs: 41.03 | 15: iteration 15230/ 125429 | consumed samples: 3898880 | consumed tokens: 7984906240 | elapsed time per iteration (s): 1.07 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.209108E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.165 | TFLOPs: 39.69 | 15: iteration 15240/ 125429 | consumed samples: 3901440 | consumed tokens: 7990149120 | elapsed time per iteration (s): 1.03 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.219460E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.302 | TFLOPs: 41.20 | 15: iteration 15250/ 125429 | consumed samples: 3904000 | consumed tokens: 7995392000 | elapsed time per iteration (s): 1.06 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.202131E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.852 | TFLOPs: 39.97 | 15: iteration 15260/ 125429 | consumed samples: 3906560 | consumed tokens: 8000634880 | elapsed time per iteration (s): 1.03 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.224323E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.258 | TFLOPs: 41.19 | 15: iteration 15270/ 125429 | consumed samples: 3909120 | consumed tokens: 8005877760 | elapsed time per iteration (s): 1.04 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.202535E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.988 | TFLOPs: 40.82 | 15: iteration 15280/ 125429 | consumed samples: 3911680 | consumed tokens: 8011120640 | elapsed time per iteration (s): 1.03 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.220667E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.052 | TFLOPs: 41.16 | 15: iteration 15290/ 125429 | consumed samples: 3914240 | consumed tokens: 8016363520 | elapsed time per iteration (s): 1.03 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.196173E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.514 | TFLOPs: 41.07 | 15: iteration 15300/ 125429 | consumed samples: 3916800 | consumed tokens: 8021606400 | elapsed time per iteration (s): 1.04 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.198998E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.874 | TFLOPs: 40.63 | 15: iteration 15310/ 125429 | consumed samples: 3919360 | consumed tokens: 8026849280 | elapsed time per iteration (s): 1.04 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.214338E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.544 | TFLOPs: 40.74 | 15: iteration 15320/ 125429 | consumed samples: 3921920 | consumed tokens: 8032092160 | elapsed time per iteration (s): 1.04 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.206038E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.969 | TFLOPs: 40.81 | 15: iteration 15330/ 125429 | consumed samples: 3924480 | consumed tokens: 8037335040 | elapsed time per iteration (s): 1.03 | learning rate: 1.944E-04 | global batch size: 256 | lm loss: 2.192785E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.610 | TFLOPs: 41.08 | 15: iteration 15340/ 125429 | consumed samples: 3927040 | consumed tokens: 8042577920 | elapsed time per iteration (s): 1.06 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.220245E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.056 | TFLOPs: 40.00 | 15: iteration 15350/ 125429 | consumed samples: 3929600 | consumed tokens: 8047820800 | elapsed time per iteration (s): 1.03 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.204818E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.059 | TFLOPs: 41.16 | 15: iteration 15360/ 125429 | consumed samples: 3932160 | consumed tokens: 8053063680 | elapsed time per iteration (s): 1.05 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.210107E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.440 | TFLOPs: 40.23 | 15: iteration 15370/ 125429 | consumed samples: 3934720 | consumed tokens: 8058306560 | elapsed time per iteration (s): 1.04 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.218196E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.637 | TFLOPs: 40.76 | 15: iteration 15380/ 125429 | consumed samples: 3937280 | consumed tokens: 8063549440 | elapsed time per iteration (s): 1.04 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.185210E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.310 | TFLOPs: 40.70 | 15: iteration 15390/ 125429 | consumed samples: 3939840 | consumed tokens: 8068792320 | elapsed time per iteration (s): 1.03 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.222733E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.059 | TFLOPs: 41.16 | 15: iteration 15400/ 125429 | consumed samples: 3942400 | consumed tokens: 8074035200 | elapsed time per iteration (s): 1.04 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.193483E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.914 | TFLOPs: 40.64 | 15: iteration 15410/ 125429 | consumed samples: 3944960 | consumed tokens: 8079278080 | elapsed time per iteration (s): 1.04 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.234122E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.194 | TFLOPs: 40.52 | 15: iteration 15420/ 125429 | consumed samples: 3947520 | consumed tokens: 8084520960 | elapsed time per iteration (s): 1.06 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.227706E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.012 | TFLOPs: 39.99 | 15: iteration 15430/ 125429 | consumed samples: 3950080 | consumed tokens: 8089763840 | elapsed time per iteration (s): 1.03 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.203654E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.636 | TFLOPs: 41.25 | 15: iteration 15440/ 125429 | consumed samples: 3952640 | consumed tokens: 8095006720 | elapsed time per iteration (s): 1.05 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.202118E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.342 | TFLOPs: 40.21 | 15: iteration 15450/ 125429 | consumed samples: 3955200 | consumed tokens: 8100249600 | elapsed time per iteration (s): 1.03 | learning rate: 1.943E-04 | global batch size: 256 | lm loss: 2.214731E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.309 | TFLOPs: 41.20 | 15: iteration 15460/ 125429 | consumed samples: 3957760 | consumed tokens: 8105492480 | elapsed time per iteration (s): 1.04 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.212617E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.327 | TFLOPs: 40.87 | 15: iteration 15470/ 125429 | consumed samples: 3960320 | consumed tokens: 8110735360 | elapsed time per iteration (s): 1.03 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.212294E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.840 | TFLOPs: 41.12 | 15: iteration 15480/ 125429 | consumed samples: 3962880 | consumed tokens: 8115978240 | elapsed time per iteration (s): 1.11 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.214907E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.139 | TFLOPs: 38.03 | 15: iteration 15490/ 125429 | consumed samples: 3965440 | consumed tokens: 8121221120 | elapsed time per iteration (s): 1.04 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.203888E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.367 | TFLOPs: 40.71 | 15: iteration 15500/ 125429 | consumed samples: 3968000 | consumed tokens: 8126464000 | elapsed time per iteration (s): 1.03 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.208060E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.682 | TFLOPs: 41.26 | 15: iteration 15510/ 125429 | consumed samples: 3970560 | consumed tokens: 8131706880 | elapsed time per iteration (s): 1.04 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.216778E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.945 | TFLOPs: 40.64 | 15: iteration 15520/ 125429 | consumed samples: 3973120 | consumed tokens: 8136949760 | elapsed time per iteration (s): 1.07 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.209500E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.261 | TFLOPs: 39.70 | 15: iteration 15530/ 125429 | consumed samples: 3975680 | consumed tokens: 8142192640 | elapsed time per iteration (s): 1.04 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.227956E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.449 | TFLOPs: 40.56 | 15: iteration 15540/ 125429 | consumed samples: 3978240 | consumed tokens: 8147435520 | elapsed time per iteration (s): 1.05 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.219659E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.716 | TFLOPs: 40.44 | 15: iteration 15550/ 125429 | consumed samples: 3980800 | consumed tokens: 8152678400 | elapsed time per iteration (s): 1.04 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.187531E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.488 | TFLOPs: 40.57 | 15: iteration 15560/ 125429 | consumed samples: 3983360 | consumed tokens: 8157921280 | elapsed time per iteration (s): 1.04 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.211126E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.311 | TFLOPs: 40.87 | 15: iteration 15570/ 125429 | consumed samples: 3985920 | consumed tokens: 8163164160 | elapsed time per iteration (s): 1.04 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.199152E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.354 | TFLOPs: 40.71 | 15: iteration 15580/ 125429 | consumed samples: 3988480 | consumed tokens: 8168407040 | elapsed time per iteration (s): 1.03 | learning rate: 1.942E-04 | global batch size: 256 | lm loss: 2.205244E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.680 | TFLOPs: 41.10 | 15: iteration 15590/ 125429 | consumed samples: 3991040 | consumed tokens: 8173649920 | elapsed time per iteration (s): 1.03 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.229851E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.144 | TFLOPs: 41.01 | 15: iteration 15600/ 125429 | consumed samples: 3993600 | consumed tokens: 8178892800 | elapsed time per iteration (s): 1.06 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.209244E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.233 | TFLOPs: 39.87 | 15: iteration 15610/ 125429 | consumed samples: 3996160 | consumed tokens: 8184135680 | elapsed time per iteration (s): 1.04 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.221327E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.109 | TFLOPs: 40.67 | 15: iteration 15620/ 125429 | consumed samples: 3998720 | consumed tokens: 8189378560 | elapsed time per iteration (s): 1.06 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.185488E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.442 | TFLOPs: 39.73 | 15: iteration 15630/ 125429 | consumed samples: 4001280 | consumed tokens: 8194621440 | elapsed time per iteration (s): 1.04 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.217826E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.413 | TFLOPs: 40.56 | 15: iteration 15640/ 125429 | consumed samples: 4003840 | consumed tokens: 8199864320 | elapsed time per iteration (s): 1.04 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.210654E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.207 | TFLOPs: 40.85 | 15: iteration 15650/ 125429 | consumed samples: 4006400 | consumed tokens: 8205107200 | elapsed time per iteration (s): 1.08 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.242658E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.984 | TFLOPs: 39.33 | 15: iteration 15660/ 125429 | consumed samples: 4008960 | consumed tokens: 8210350080 | elapsed time per iteration (s): 1.06 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.177859E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.202 | TFLOPs: 39.86 | 15: iteration 15670/ 125429 | consumed samples: 4011520 | consumed tokens: 8215592960 | elapsed time per iteration (s): 1.05 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.204574E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.636 | TFLOPs: 40.26 | 15: iteration 15680/ 125429 | consumed samples: 4014080 | consumed tokens: 8220835840 | elapsed time per iteration (s): 1.09 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.191162E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.687 | TFLOPs: 38.78 | 15: iteration 15690/ 125429 | consumed samples: 4016640 | consumed tokens: 8226078720 | elapsed time per iteration (s): 1.06 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.228849E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.283 | TFLOPs: 39.87 | 15: iteration 15700/ 125429 | consumed samples: 4019200 | consumed tokens: 8231321600 | elapsed time per iteration (s): 1.09 | learning rate: 1.941E-04 | global batch size: 256 | lm loss: 2.209067E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.029 | TFLOPs: 38.84 | 15: iteration 15710/ 125429 | consumed samples: 4021760 | consumed tokens: 8236564480 | elapsed time per iteration (s): 1.04 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.218131E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.734 | TFLOPs: 40.77 | 15: iteration 15720/ 125429 | consumed samples: 4024320 | consumed tokens: 8241807360 | elapsed time per iteration (s): 1.06 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.225886E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.024 | TFLOPs: 39.83 | 15: iteration 15730/ 125429 | consumed samples: 4026880 | consumed tokens: 8247050240 | elapsed time per iteration (s): 1.04 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.215473E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.510 | TFLOPs: 40.74 | 15: iteration 15740/ 125429 | consumed samples: 4029440 | consumed tokens: 8252293120 | elapsed time per iteration (s): 1.04 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.211259E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.045 | TFLOPs: 40.66 | 15: iteration 15750/ 125429 | consumed samples: 4032000 | consumed tokens: 8257536000 | elapsed time per iteration (s): 1.02 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.202897E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.409 | TFLOPs: 41.38 | 15: iteration 15760/ 125429 | consumed samples: 4034560 | consumed tokens: 8262778880 | elapsed time per iteration (s): 1.02 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.214709E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.033 | TFLOPs: 41.32 | 15: iteration 15770/ 125429 | consumed samples: 4037120 | consumed tokens: 8268021760 | elapsed time per iteration (s): 1.02 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.211133E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.190 | TFLOPs: 41.51 | 15: iteration 15780/ 125429 | consumed samples: 4039680 | consumed tokens: 8273264640 | elapsed time per iteration (s): 1.04 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.189210E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.005 | TFLOPs: 40.65 | 15: iteration 15790/ 125429 | consumed samples: 4042240 | consumed tokens: 8278507520 | elapsed time per iteration (s): 1.05 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.188484E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.915 | TFLOPs: 40.47 | 15: iteration 15800/ 125429 | consumed samples: 4044800 | consumed tokens: 8283750400 | elapsed time per iteration (s): 1.03 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.185827E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.225 | TFLOPs: 41.19 | 15: iteration 15810/ 125429 | consumed samples: 4047360 | consumed tokens: 8288993280 | elapsed time per iteration (s): 1.06 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.169618E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.798 | TFLOPs: 39.96 | 15: iteration 15820/ 125429 | consumed samples: 4049920 | consumed tokens: 8294236160 | elapsed time per iteration (s): 1.03 | learning rate: 1.940E-04 | global batch size: 256 | lm loss: 2.224739E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.595 | TFLOPs: 40.92 | 15: iteration 15830/ 125429 | consumed samples: 4052480 | consumed tokens: 8299479040 | elapsed time per iteration (s): 1.06 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.204543E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.237 | TFLOPs: 39.87 | 15: iteration 15840/ 125429 | consumed samples: 4055040 | consumed tokens: 8304721920 | elapsed time per iteration (s): 1.06 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.217747E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.271 | TFLOPs: 40.04 | 15: iteration 15850/ 125429 | consumed samples: 4057600 | consumed tokens: 8309964800 | elapsed time per iteration (s): 1.05 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.216219E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.634 | TFLOPs: 40.26 | 15: iteration 15860/ 125429 | consumed samples: 4060160 | consumed tokens: 8315207680 | elapsed time per iteration (s): 1.05 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.200608E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.398 | TFLOPs: 40.22 | 15: iteration 15870/ 125429 | consumed samples: 4062720 | consumed tokens: 8320450560 | elapsed time per iteration (s): 1.03 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.184348E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.852 | TFLOPs: 41.12 | 15: iteration 15880/ 125429 | consumed samples: 4065280 | consumed tokens: 8325693440 | elapsed time per iteration (s): 1.03 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.195573E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.355 | TFLOPs: 41.21 | 15: iteration 15890/ 125429 | consumed samples: 4067840 | consumed tokens: 8330936320 | elapsed time per iteration (s): 1.04 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.202684E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.294 | TFLOPs: 40.70 | 15: iteration 15900/ 125429 | consumed samples: 4070400 | consumed tokens: 8336179200 | elapsed time per iteration (s): 1.04 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.210704E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.869 | TFLOPs: 40.63 | 15: iteration 15910/ 125429 | consumed samples: 4072960 | consumed tokens: 8341422080 | elapsed time per iteration (s): 1.03 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.211957E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.760 | TFLOPs: 41.11 | 15: iteration 15920/ 125429 | consumed samples: 4075520 | consumed tokens: 8346664960 | elapsed time per iteration (s): 1.08 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.168682E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.934 | TFLOPs: 39.32 | 15: iteration 15930/ 125429 | consumed samples: 4078080 | consumed tokens: 8351907840 | elapsed time per iteration (s): 1.03 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.193143E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.072 | TFLOPs: 41.16 | 15: iteration 15940/ 125429 | consumed samples: 4080640 | consumed tokens: 8357150720 | elapsed time per iteration (s): 1.04 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.199705E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.091 | TFLOPs: 40.67 | 15: iteration 15950/ 125429 | consumed samples: 4083200 | consumed tokens: 8362393600 | elapsed time per iteration (s): 1.52 | learning rate: 1.939E-04 | global batch size: 256 | lm loss: 2.210487E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 168.604 | TFLOPs: 27.86 | 15: iteration 15960/ 125429 | consumed samples: 4085760 | consumed tokens: 8367636480 | elapsed time per iteration (s): 1.05 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.188455E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.116 | TFLOPs: 40.18 | 15: iteration 15970/ 125429 | consumed samples: 4088320 | consumed tokens: 8372879360 | elapsed time per iteration (s): 1.05 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.204202E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.608 | TFLOPs: 40.26 | 15: iteration 15980/ 125429 | consumed samples: 4090880 | consumed tokens: 8378122240 | elapsed time per iteration (s): 1.02 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.224088E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.359 | TFLOPs: 41.54 | 15: iteration 15990/ 125429 | consumed samples: 4093440 | consumed tokens: 8383365120 | elapsed time per iteration (s): 1.05 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.215653E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.694 | TFLOPs: 40.44 | 0: [2022-11-26 00:33:36,307] [INFO] [logging.py:68:log_dist] [Rank 0] step=16000, skipped=0, lr=[0.00019380937976738922, 0.00019380937976738922, 0.00019380937976738922], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 16000/ 125429 | consumed samples: 4096000 | consumed tokens: 8388608000 | elapsed time per iteration (s): 1.03 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.202071E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.518 | TFLOPs: 41.23 | 0: steps: 16000 loss: 2.3027 iter time (s): 1.064 samples/sec: 240.590 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 16000 | lm loss value: 2.101533E+00 | lm loss PPL: 8.178696E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 16000 to checkpoints_1b5 0: [2022-11-26 00:33:36,653] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step16000 is begin to save! 0: [2022-11-26 00:33:36,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:33:36,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:33:36,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:33:37,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:33:37,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:33:37,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:33:37,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:33:37,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:33:37,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:33:37,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:33:37,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:33:37,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:33:37,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:33:37,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:33:37,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:33:37,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:33:37,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:33:37,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:33:37,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:33:37,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:33:37,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:33:38,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:33:38,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:33:38,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:33:38,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:33:38,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:33:38,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:33:38,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:33:38,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:33:38,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:33:38,447] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:33:38,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:33:38,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:33:38,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:33:38,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:33:38,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:33:38,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:33:38,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:33:38,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:33:38,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:33:38,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:33:39,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:33:39,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:33:39,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:33:39,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:33:39,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:33:39,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:33:39,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:33:39,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:33:39,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:33:39,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:33:39,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:33:39,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:33:39,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:33:39,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_29-model_00-model_states.pt... 0: [2022-11-26 00:33:39,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_29-model_00-model_states.pt. 0: [2022-11-26 00:33:39,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:33:39,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:33:39,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/layer_32-model_00-model_states.pt... 0: [2022-11-26 00:33:39,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/layer_32-model_00-model_states.pt. 0: [2022-11-26 00:33:39,922] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step16000/mp_rank_00_model_states.pt 0: [2022-11-26 00:33:39,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:33:39,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:33:39,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step16000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:33:40,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 00:33:40,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 00:33:40,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:33:40,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:33:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 00:33:40,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:33:40,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:33:40,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 2: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 00:33:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 00:33:40,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:33:40,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:33:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 00:33:40,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 00:33:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-26 00:33:40,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-26 00:33:40,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:33:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:33:40,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:33:40,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 2: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-26 00:33:40,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:33:40,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 9: [2022-11-26 00:33:40,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:33:40,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 00:33:40,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 00:33:40,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 00:33:40,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 00:33:40,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 00:33:40,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 00:33:40,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 00:33:40,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 7: [2022-11-26 00:33:40,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:33:40,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 00:33:40,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 00:33:40,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 00:33:40,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 00:33:40,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:33:40,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 8: [2022-11-26 00:33:40,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:33:40,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 00:33:40,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 1: [2022-11-26 00:33:40,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 00:33:40,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-26 00:33:40,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:33:40,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:33:40,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:33:40,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 11: [2022-11-26 00:33:40,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:33:40,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 00:33:40,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-26 00:33:40,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 12: [2022-11-26 00:33:40,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:33:40,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 00:33:40,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 10: [2022-11-26 00:33:40,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:33:40,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 00:33:40,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 6: [2022-11-26 00:33:40,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:33:40,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 00:33:40,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 00:33:40,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 5: [2022-11-26 00:33:40,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:33:40,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:33:40,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 13: [2022-11-26 00:33:40,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:33:40,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 00:33:40,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: [2022-11-26 00:33:40,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 00:33:40,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 2: [2022-11-26 00:33:40,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:33:40,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 00:33:40,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 4: [2022-11-26 00:33:40,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:33:40,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 00:33:40,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 14: [2022-11-26 00:33:40,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:33:40,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 00:33:40,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 3: [2022-11-26 00:33:40,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:33:40,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 00:33:40,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:33:40,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 15: [2022-11-26 00:33:40,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step16000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 00:33:40,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step16000 is ready now! 0: successfully saved checkpoint at iteration 16000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3703.03 15: iteration 16010/ 125429 | consumed samples: 4098560 | consumed tokens: 8393850880 | elapsed time per iteration (s): 1.44 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.194727E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.266 | TFLOPs: 29.46 | 15: iteration 16020/ 125429 | consumed samples: 4101120 | consumed tokens: 8399093760 | elapsed time per iteration (s): 1.04 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.198434E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.959 | TFLOPs: 40.65 | 15: iteration 16030/ 125429 | consumed samples: 4103680 | consumed tokens: 8404336640 | elapsed time per iteration (s): 1.05 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.188286E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.834 | TFLOPs: 40.30 | 15: iteration 16040/ 125429 | consumed samples: 4106240 | consumed tokens: 8409579520 | elapsed time per iteration (s): 1.04 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.227012E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.058 | TFLOPs: 40.50 | 15: iteration 16050/ 125429 | consumed samples: 4108800 | consumed tokens: 8414822400 | elapsed time per iteration (s): 1.04 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.168261E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.311 | TFLOPs: 40.54 | 15: iteration 16060/ 125429 | consumed samples: 4111360 | consumed tokens: 8420065280 | elapsed time per iteration (s): 1.06 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.232769E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.483 | TFLOPs: 39.91 | 15: iteration 16070/ 125429 | consumed samples: 4113920 | consumed tokens: 8425308160 | elapsed time per iteration (s): 1.05 | learning rate: 1.938E-04 | global batch size: 256 | lm loss: 2.199684E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.089 | TFLOPs: 40.17 | 15: iteration 16080/ 125429 | consumed samples: 4116480 | consumed tokens: 8430551040 | elapsed time per iteration (s): 1.05 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.194140E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.153 | TFLOPs: 40.35 | 15: iteration 16090/ 125429 | consumed samples: 4119040 | consumed tokens: 8435793920 | elapsed time per iteration (s): 1.03 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.197167E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.529 | TFLOPs: 41.24 | 15: iteration 16100/ 125429 | consumed samples: 4121600 | consumed tokens: 8441036800 | elapsed time per iteration (s): 1.04 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.179863E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.142 | TFLOPs: 40.84 | 15: iteration 16110/ 125429 | consumed samples: 4124160 | consumed tokens: 8446279680 | elapsed time per iteration (s): 1.15 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.211218E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.410 | TFLOPs: 36.76 | 15: iteration 16120/ 125429 | consumed samples: 4126720 | consumed tokens: 8451522560 | elapsed time per iteration (s): 1.08 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.220678E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.073 | TFLOPs: 39.18 | 15: iteration 16130/ 125429 | consumed samples: 4129280 | consumed tokens: 8456765440 | elapsed time per iteration (s): 1.03 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.188683E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.785 | TFLOPs: 41.11 | 15: iteration 16140/ 125429 | consumed samples: 4131840 | consumed tokens: 8462008320 | elapsed time per iteration (s): 1.05 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.174133E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.111 | TFLOPs: 40.18 | 15: iteration 16150/ 125429 | consumed samples: 4134400 | consumed tokens: 8467251200 | elapsed time per iteration (s): 1.06 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.205192E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.882 | TFLOPs: 39.81 | 15: iteration 16160/ 125429 | consumed samples: 4136960 | consumed tokens: 8472494080 | elapsed time per iteration (s): 1.04 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.198693E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.439 | TFLOPs: 40.56 | 15: iteration 16170/ 125429 | consumed samples: 4139520 | consumed tokens: 8477736960 | elapsed time per iteration (s): 1.19 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.185170E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.353 | TFLOPs: 35.59 | 15: iteration 16180/ 125429 | consumed samples: 4142080 | consumed tokens: 8482979840 | elapsed time per iteration (s): 1.18 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.212401E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.639 | TFLOPs: 35.97 | 15: iteration 16190/ 125429 | consumed samples: 4144640 | consumed tokens: 8488222720 | elapsed time per iteration (s): 1.02 | learning rate: 1.937E-04 | global batch size: 256 | lm loss: 2.177953E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.185 | TFLOPs: 41.34 | 15: iteration 16200/ 125429 | consumed samples: 4147200 | consumed tokens: 8493465600 | elapsed time per iteration (s): 1.12 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.192845E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.622 | TFLOPs: 37.62 | 15: iteration 16210/ 125429 | consumed samples: 4149760 | consumed tokens: 8498708480 | elapsed time per iteration (s): 1.04 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.192448E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.728 | TFLOPs: 40.61 | 15: iteration 16220/ 125429 | consumed samples: 4152320 | consumed tokens: 8503951360 | elapsed time per iteration (s): 1.07 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.177811E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.087 | TFLOPs: 39.51 | 15: iteration 16230/ 125429 | consumed samples: 4154880 | consumed tokens: 8509194240 | elapsed time per iteration (s): 1.04 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.192168E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.018 | TFLOPs: 40.66 | 15: iteration 16240/ 125429 | consumed samples: 4157440 | consumed tokens: 8514437120 | elapsed time per iteration (s): 1.06 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.181487E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.236 | TFLOPs: 39.87 | 15: iteration 16250/ 125429 | consumed samples: 4160000 | consumed tokens: 8519680000 | elapsed time per iteration (s): 1.07 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.206122E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.687 | TFLOPs: 39.44 | 15: iteration 16260/ 125429 | consumed samples: 4162560 | consumed tokens: 8524922880 | elapsed time per iteration (s): 2.76 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.185022E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 92.787 | TFLOPs: 15.33 | 15: iteration 16270/ 125429 | consumed samples: 4165120 | consumed tokens: 8530165760 | elapsed time per iteration (s): 1.05 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.176952E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.460 | TFLOPs: 40.40 | 15: iteration 16280/ 125429 | consumed samples: 4167680 | consumed tokens: 8535408640 | elapsed time per iteration (s): 1.03 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.190190E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.764 | TFLOPs: 40.94 | 15: iteration 16290/ 125429 | consumed samples: 4170240 | consumed tokens: 8540651520 | elapsed time per iteration (s): 1.05 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.202076E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.504 | TFLOPs: 40.24 | 15: iteration 16300/ 125429 | consumed samples: 4172800 | consumed tokens: 8545894400 | elapsed time per iteration (s): 1.04 | learning rate: 1.936E-04 | global batch size: 256 | lm loss: 2.185854E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.038 | TFLOPs: 40.49 | 15: iteration 16310/ 125429 | consumed samples: 4175360 | consumed tokens: 8551137280 | elapsed time per iteration (s): 1.03 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.208913E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.086 | TFLOPs: 41.00 | 15: iteration 16320/ 125429 | consumed samples: 4177920 | consumed tokens: 8556380160 | elapsed time per iteration (s): 1.08 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.211322E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.956 | TFLOPs: 38.99 | 15: iteration 16330/ 125429 | consumed samples: 4180480 | consumed tokens: 8561623040 | elapsed time per iteration (s): 1.05 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.182980E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.363 | TFLOPs: 40.22 | 15: iteration 16340/ 125429 | consumed samples: 4183040 | consumed tokens: 8566865920 | elapsed time per iteration (s): 1.04 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.202128E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.233 | TFLOPs: 40.69 | 15: iteration 16350/ 125429 | consumed samples: 4185600 | consumed tokens: 8572108800 | elapsed time per iteration (s): 1.04 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.188449E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.845 | TFLOPs: 40.79 | 15: iteration 16360/ 125429 | consumed samples: 4188160 | consumed tokens: 8577351680 | elapsed time per iteration (s): 1.07 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.206978E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.641 | TFLOPs: 39.44 | 15: iteration 16370/ 125429 | consumed samples: 4190720 | consumed tokens: 8582594560 | elapsed time per iteration (s): 1.12 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.212985E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.345 | TFLOPs: 37.74 | 15: iteration 16380/ 125429 | consumed samples: 4193280 | consumed tokens: 8587837440 | elapsed time per iteration (s): 1.03 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.213474E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.153 | TFLOPs: 41.17 | 15: iteration 16390/ 125429 | consumed samples: 4195840 | consumed tokens: 8593080320 | elapsed time per iteration (s): 1.02 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.201600E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.467 | TFLOPs: 41.39 | 15: iteration 16400/ 125429 | consumed samples: 4198400 | consumed tokens: 8598323200 | elapsed time per iteration (s): 1.09 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.206650E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.266 | TFLOPs: 38.71 | 15: iteration 16410/ 125429 | consumed samples: 4200960 | consumed tokens: 8603566080 | elapsed time per iteration (s): 1.04 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.189451E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.291 | TFLOPs: 40.87 | 15: iteration 16420/ 125429 | consumed samples: 4203520 | consumed tokens: 8608808960 | elapsed time per iteration (s): 1.04 | learning rate: 1.935E-04 | global batch size: 256 | lm loss: 2.186381E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.556 | TFLOPs: 40.75 | 15: iteration 16430/ 125429 | consumed samples: 4206080 | consumed tokens: 8614051840 | elapsed time per iteration (s): 1.04 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.200172E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.078 | TFLOPs: 40.83 | 15: iteration 16440/ 125429 | consumed samples: 4208640 | consumed tokens: 8619294720 | elapsed time per iteration (s): 1.06 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.189704E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.110 | TFLOPs: 40.01 | 15: iteration 16450/ 125429 | consumed samples: 4211200 | consumed tokens: 8624537600 | elapsed time per iteration (s): 1.04 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.181122E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.930 | TFLOPs: 40.81 | 15: iteration 16460/ 125429 | consumed samples: 4213760 | consumed tokens: 8629780480 | elapsed time per iteration (s): 1.02 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.184714E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.210 | TFLOPs: 41.35 | 15: iteration 16470/ 125429 | consumed samples: 4216320 | consumed tokens: 8635023360 | elapsed time per iteration (s): 1.02 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.198895E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.794 | TFLOPs: 41.28 | 15: iteration 16480/ 125429 | consumed samples: 4218880 | consumed tokens: 8640266240 | elapsed time per iteration (s): 1.02 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.215002E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.133 | TFLOPs: 41.50 | 15: iteration 16490/ 125429 | consumed samples: 4221440 | consumed tokens: 8645509120 | elapsed time per iteration (s): 1.05 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.180773E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.975 | TFLOPs: 40.48 | 15: iteration 16500/ 125429 | consumed samples: 4224000 | consumed tokens: 8650752000 | elapsed time per iteration (s): 1.06 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.207312E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.016 | TFLOPs: 39.83 | 15: iteration 16510/ 125429 | consumed samples: 4226560 | consumed tokens: 8655994880 | elapsed time per iteration (s): 1.05 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.187239E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.562 | TFLOPs: 40.42 | 15: iteration 16520/ 125429 | consumed samples: 4229120 | consumed tokens: 8661237760 | elapsed time per iteration (s): 1.03 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.205495E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.932 | TFLOPs: 40.97 | 15: iteration 16530/ 125429 | consumed samples: 4231680 | consumed tokens: 8666480640 | elapsed time per iteration (s): 1.04 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.199579E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.429 | TFLOPs: 40.72 | 15: iteration 16540/ 125429 | consumed samples: 4234240 | consumed tokens: 8671723520 | elapsed time per iteration (s): 1.08 | learning rate: 1.934E-04 | global batch size: 256 | lm loss: 2.188568E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.629 | TFLOPs: 39.10 | 15: iteration 16550/ 125429 | consumed samples: 4236800 | consumed tokens: 8676966400 | elapsed time per iteration (s): 1.03 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.208308E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.812 | TFLOPs: 40.95 | 15: iteration 16560/ 125429 | consumed samples: 4239360 | consumed tokens: 8682209280 | elapsed time per iteration (s): 1.19 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.169384E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.942 | TFLOPs: 35.52 | 15: iteration 16570/ 125429 | consumed samples: 4241920 | consumed tokens: 8687452160 | elapsed time per iteration (s): 1.03 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.174869E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.417 | TFLOPs: 40.89 | 15: iteration 16580/ 125429 | consumed samples: 4244480 | consumed tokens: 8692695040 | elapsed time per iteration (s): 1.08 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.208332E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.179 | TFLOPs: 39.03 | 15: iteration 16590/ 125429 | consumed samples: 4247040 | consumed tokens: 8697937920 | elapsed time per iteration (s): 1.04 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.211900E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.057 | TFLOPs: 40.66 | 15: iteration 16600/ 125429 | consumed samples: 4249600 | consumed tokens: 8703180800 | elapsed time per iteration (s): 1.03 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.159892E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.122 | TFLOPs: 41.17 | 15: iteration 16610/ 125429 | consumed samples: 4252160 | consumed tokens: 8708423680 | elapsed time per iteration (s): 1.06 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.214530E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.476 | TFLOPs: 39.91 | 15: iteration 16620/ 125429 | consumed samples: 4254720 | consumed tokens: 8713666560 | elapsed time per iteration (s): 1.04 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.195454E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.284 | TFLOPs: 40.54 | 15: iteration 16630/ 125429 | consumed samples: 4257280 | consumed tokens: 8718909440 | elapsed time per iteration (s): 1.05 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.198669E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.205 | TFLOPs: 40.19 | 15: iteration 16640/ 125429 | consumed samples: 4259840 | consumed tokens: 8724152320 | elapsed time per iteration (s): 1.04 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.190374E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.890 | TFLOPs: 40.80 | 15: iteration 16650/ 125429 | consumed samples: 4262400 | consumed tokens: 8729395200 | elapsed time per iteration (s): 1.03 | learning rate: 1.933E-04 | global batch size: 256 | lm loss: 2.177304E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.475 | TFLOPs: 41.06 | 15: iteration 16660/ 125429 | consumed samples: 4264960 | consumed tokens: 8734638080 | elapsed time per iteration (s): 1.04 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.189255E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.932 | TFLOPs: 40.81 | 15: iteration 16670/ 125429 | consumed samples: 4267520 | consumed tokens: 8739880960 | elapsed time per iteration (s): 1.03 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.171890E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.435 | TFLOPs: 41.22 | 15: iteration 16680/ 125429 | consumed samples: 4270080 | consumed tokens: 8745123840 | elapsed time per iteration (s): 1.07 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.197590E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.902 | TFLOPs: 39.48 | 15: iteration 16690/ 125429 | consumed samples: 4272640 | consumed tokens: 8750366720 | elapsed time per iteration (s): 1.08 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.166736E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.760 | TFLOPs: 39.13 | 15: iteration 16700/ 125429 | consumed samples: 4275200 | consumed tokens: 8755609600 | elapsed time per iteration (s): 1.07 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.202603E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.027 | TFLOPs: 39.67 | 15: iteration 16710/ 125429 | consumed samples: 4277760 | consumed tokens: 8760852480 | elapsed time per iteration (s): 1.03 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.195630E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.331 | TFLOPs: 41.04 | 15: iteration 16720/ 125429 | consumed samples: 4280320 | consumed tokens: 8766095360 | elapsed time per iteration (s): 1.17 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.145227E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.467 | TFLOPs: 36.10 | 15: iteration 16730/ 125429 | consumed samples: 4282880 | consumed tokens: 8771338240 | elapsed time per iteration (s): 1.04 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.222186E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.158 | TFLOPs: 40.84 | 15: iteration 16740/ 125429 | consumed samples: 4285440 | consumed tokens: 8776581120 | elapsed time per iteration (s): 1.06 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.194164E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.505 | TFLOPs: 40.08 | 15: iteration 16750/ 125429 | consumed samples: 4288000 | consumed tokens: 8781824000 | elapsed time per iteration (s): 1.05 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.199257E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.271 | TFLOPs: 40.20 | 15: iteration 16760/ 125429 | consumed samples: 4290560 | consumed tokens: 8787066880 | elapsed time per iteration (s): 1.04 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.195607E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.235 | TFLOPs: 40.69 | 15: iteration 16770/ 125429 | consumed samples: 4293120 | consumed tokens: 8792309760 | elapsed time per iteration (s): 1.04 | learning rate: 1.932E-04 | global batch size: 256 | lm loss: 2.169089E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.637 | TFLOPs: 40.59 | 15: iteration 16780/ 125429 | consumed samples: 4295680 | consumed tokens: 8797552640 | elapsed time per iteration (s): 1.06 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.192247E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.224 | TFLOPs: 40.03 | 15: iteration 16790/ 125429 | consumed samples: 4298240 | consumed tokens: 8802795520 | elapsed time per iteration (s): 1.07 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.165235E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.402 | TFLOPs: 39.56 | 15: iteration 16800/ 125429 | consumed samples: 4300800 | consumed tokens: 8808038400 | elapsed time per iteration (s): 1.06 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.190905E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.604 | TFLOPs: 40.09 | 15: iteration 16810/ 125429 | consumed samples: 4303360 | consumed tokens: 8813281280 | elapsed time per iteration (s): 1.04 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.168429E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.302 | TFLOPs: 40.54 | 15: iteration 16820/ 125429 | consumed samples: 4305920 | consumed tokens: 8818524160 | elapsed time per iteration (s): 1.02 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.166831E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.883 | TFLOPs: 41.30 | 15: iteration 16830/ 125429 | consumed samples: 4308480 | consumed tokens: 8823767040 | elapsed time per iteration (s): 1.04 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.188069E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.696 | TFLOPs: 40.60 | 15: iteration 16840/ 125429 | consumed samples: 4311040 | consumed tokens: 8829009920 | elapsed time per iteration (s): 1.05 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.167128E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.847 | TFLOPs: 40.46 | 15: iteration 16850/ 125429 | consumed samples: 4313600 | consumed tokens: 8834252800 | elapsed time per iteration (s): 1.05 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.185741E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.692 | TFLOPs: 40.27 | 15: iteration 16860/ 125429 | consumed samples: 4316160 | consumed tokens: 8839495680 | elapsed time per iteration (s): 1.10 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.197842E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.594 | TFLOPs: 38.60 | 15: iteration 16870/ 125429 | consumed samples: 4318720 | consumed tokens: 8844738560 | elapsed time per iteration (s): 1.06 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.203495E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.595 | TFLOPs: 40.09 | 15: iteration 16880/ 125429 | consumed samples: 4321280 | consumed tokens: 8849981440 | elapsed time per iteration (s): 1.05 | learning rate: 1.931E-04 | global batch size: 256 | lm loss: 2.166532E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.067 | TFLOPs: 40.17 | 15: iteration 16890/ 125429 | consumed samples: 4323840 | consumed tokens: 8855224320 | elapsed time per iteration (s): 1.07 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.199660E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.560 | TFLOPs: 39.59 | 15: iteration 16900/ 125429 | consumed samples: 4326400 | consumed tokens: 8860467200 | elapsed time per iteration (s): 1.05 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.175305E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.832 | TFLOPs: 40.30 | 15: iteration 16910/ 125429 | consumed samples: 4328960 | consumed tokens: 8865710080 | elapsed time per iteration (s): 1.04 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.167912E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.157 | TFLOPs: 40.84 | 15: iteration 16920/ 125429 | consumed samples: 4331520 | consumed tokens: 8870952960 | elapsed time per iteration (s): 1.07 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.165499E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.606 | TFLOPs: 39.60 | 15: iteration 16930/ 125429 | consumed samples: 4334080 | consumed tokens: 8876195840 | elapsed time per iteration (s): 1.06 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.170438E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.458 | TFLOPs: 39.90 | 15: iteration 16940/ 125429 | consumed samples: 4336640 | consumed tokens: 8881438720 | elapsed time per iteration (s): 1.04 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.185897E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.328 | TFLOPs: 40.54 | 15: iteration 16950/ 125429 | consumed samples: 4339200 | consumed tokens: 8886681600 | elapsed time per iteration (s): 1.55 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.166437E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 165.086 | TFLOPs: 27.28 | 15: iteration 16960/ 125429 | consumed samples: 4341760 | consumed tokens: 8891924480 | elapsed time per iteration (s): 1.03 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.183088E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.639 | TFLOPs: 41.09 | 15: iteration 16970/ 125429 | consumed samples: 4344320 | consumed tokens: 8897167360 | elapsed time per iteration (s): 1.03 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.182688E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.374 | TFLOPs: 41.21 | 15: iteration 16980/ 125429 | consumed samples: 4346880 | consumed tokens: 8902410240 | elapsed time per iteration (s): 1.05 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.211020E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.661 | TFLOPs: 40.27 | 15: iteration 16990/ 125429 | consumed samples: 4349440 | consumed tokens: 8907653120 | elapsed time per iteration (s): 1.04 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.201660E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.390 | TFLOPs: 40.72 | 15: iteration 17000/ 125429 | consumed samples: 4352000 | consumed tokens: 8912896000 | elapsed time per iteration (s): 1.04 | learning rate: 1.930E-04 | global batch size: 256 | lm loss: 2.197619E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.389 | TFLOPs: 40.55 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 17000 | lm loss value: 2.184040E+00 | lm loss PPL: 8.882120E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 17000 to checkpoints_1b5 0: [2022-11-26 00:51:38,260] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step17000 is begin to save! 0: [2022-11-26 00:51:38,288] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_01-model_00-model_states.pt... 0: [2022-11-26 00:51:38,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_01-model_00-model_states.pt. 0: [2022-11-26 00:51:38,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_03-model_00-model_states.pt... 0: [2022-11-26 00:51:38,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_03-model_00-model_states.pt. 0: [2022-11-26 00:51:38,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_04-model_00-model_states.pt... 0: [2022-11-26 00:51:38,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_04-model_00-model_states.pt. 0: [2022-11-26 00:51:38,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_05-model_00-model_states.pt... 0: [2022-11-26 00:51:38,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_05-model_00-model_states.pt. 0: [2022-11-26 00:51:38,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_06-model_00-model_states.pt... 0: [2022-11-26 00:51:39,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_06-model_00-model_states.pt. 0: [2022-11-26 00:51:39,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_07-model_00-model_states.pt... 0: [2022-11-26 00:51:39,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_07-model_00-model_states.pt. 0: [2022-11-26 00:51:39,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_08-model_00-model_states.pt... 0: [2022-11-26 00:51:39,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_08-model_00-model_states.pt. 0: [2022-11-26 00:51:39,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_09-model_00-model_states.pt... 0: [2022-11-26 00:51:39,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_09-model_00-model_states.pt. 0: [2022-11-26 00:51:39,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_10-model_00-model_states.pt... 0: [2022-11-26 00:51:39,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_10-model_00-model_states.pt. 0: [2022-11-26 00:51:39,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_11-model_00-model_states.pt... 0: [2022-11-26 00:51:39,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_11-model_00-model_states.pt. 0: [2022-11-26 00:51:39,556] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_12-model_00-model_states.pt... 0: [2022-11-26 00:51:39,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_12-model_00-model_states.pt. 0: [2022-11-26 00:51:39,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_13-model_00-model_states.pt... 0: [2022-11-26 00:51:39,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_13-model_00-model_states.pt. 0: [2022-11-26 00:51:39,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_14-model_00-model_states.pt... 0: [2022-11-26 00:51:39,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_14-model_00-model_states.pt. 0: [2022-11-26 00:51:39,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_15-model_00-model_states.pt... 0: [2022-11-26 00:51:39,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_15-model_00-model_states.pt. 0: [2022-11-26 00:51:39,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_16-model_00-model_states.pt... 0: [2022-11-26 00:51:40,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_16-model_00-model_states.pt. 0: [2022-11-26 00:51:40,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_17-model_00-model_states.pt... 0: [2022-11-26 00:51:40,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_17-model_00-model_states.pt. 0: [2022-11-26 00:51:40,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_18-model_00-model_states.pt... 0: [2022-11-26 00:51:40,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_18-model_00-model_states.pt. 0: [2022-11-26 00:51:40,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_19-model_00-model_states.pt... 0: [2022-11-26 00:51:40,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_19-model_00-model_states.pt. 0: [2022-11-26 00:51:40,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_20-model_00-model_states.pt... 0: [2022-11-26 00:51:40,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_20-model_00-model_states.pt. 0: [2022-11-26 00:51:40,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_21-model_00-model_states.pt... 0: [2022-11-26 00:51:40,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_21-model_00-model_states.pt. 0: [2022-11-26 00:51:40,564] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_22-model_00-model_states.pt... 0: [2022-11-26 00:51:40,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_22-model_00-model_states.pt. 0: [2022-11-26 00:51:40,664] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_23-model_00-model_states.pt... 0: [2022-11-26 00:51:40,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_23-model_00-model_states.pt. 0: [2022-11-26 00:51:40,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_24-model_00-model_states.pt... 0: [2022-11-26 00:51:40,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_24-model_00-model_states.pt. 0: [2022-11-26 00:51:40,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_25-model_00-model_states.pt... 0: [2022-11-26 00:51:40,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_25-model_00-model_states.pt. 0: [2022-11-26 00:51:40,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_26-model_00-model_states.pt... 0: [2022-11-26 00:51:41,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_26-model_00-model_states.pt. 0: [2022-11-26 00:51:41,071] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_27-model_00-model_states.pt... 0: [2022-11-26 00:51:41,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_27-model_00-model_states.pt. 0: [2022-11-26 00:51:41,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_28-model_00-model_states.pt... 0: [2022-11-26 00:51:41,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_28-model_00-model_states.pt. 0: [2022-11-26 00:51:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_29-model_00-model_states.pt... 0: [2022-11-26 00:51:41,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_29-model_00-model_states.pt. 0: [2022-11-26 00:51:41,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_30-model_00-model_states.pt... 0: [2022-11-26 00:51:41,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_30-model_00-model_states.pt. 0: [2022-11-26 00:51:41,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/layer_32-model_00-model_states.pt... 0: [2022-11-26 00:51:41,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/layer_32-model_00-model_states.pt. 0: [2022-11-26 00:51:41,481] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step17000/mp_rank_00_model_states.pt 0: [2022-11-26 00:51:41,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/mp_rank_00_model_states.pt... 0: [2022-11-26 00:51:41,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/mp_rank_00_model_states.pt. 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 6: [2022-11-26 00:51:41,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 10: [2022-11-26 00:51:41,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-26 00:51:41,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-26 00:51:41,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 00:51:41,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:51:41,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:51:41,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:51:41,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-26 00:51:41,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-26 00:51:41,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:51:41,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-26 00:51:41,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-26 00:51:41,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 10: [2022-11-26 00:51:41,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 00:51:41,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 00:51:41,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 00:51:41,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:51:41,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:51:41,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 00:51:41,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 00:51:41,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 12: [2022-11-26 00:51:41,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 12: [2022-11-26 00:51:41,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 7: [2022-11-26 00:51:41,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 00:51:41,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 00:51:41,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:51:41,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:51:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:51:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 00:51:41,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-26 00:51:41,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 9: [2022-11-26 00:51:41,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 3: [2022-11-26 00:51:41,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 00:51:41,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 00:51:41,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 2: [2022-11-26 00:51:41,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 00:51:41,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 00:51:41,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:51:41,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:51:41,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:51:41,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:51:41,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:51:41,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:51:41,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 13: [2022-11-26 00:51:41,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 00:51:41,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 00:51:41,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:51:41,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 5: [2022-11-26 00:51:41,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 00:51:41,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 00:51:41,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:51:41,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 00:51:41,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 00:51:41,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 00:51:41,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 11: [2022-11-26 00:51:41,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: [2022-11-26 00:51:41,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 00:51:41,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 00:51:41,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 8: [2022-11-26 00:51:41,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 4: [2022-11-26 00:51:41,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 15: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 14: [2022-11-26 00:51:41,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:51:41,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 00:51:41,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 00:51:41,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 00:51:41,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 15: [2022-11-26 00:51:41,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 1: [2022-11-26 00:51:41,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 00:51:41,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 00:51:41,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 6: [2022-11-26 00:51:42,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 00:51:42,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step17000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 00:51:42,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step17000 is ready now! 0: successfully saved checkpoint at iteration 17000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3818.01 15: iteration 17010/ 125429 | consumed samples: 4354560 | consumed tokens: 8918138880 | elapsed time per iteration (s): 1.45 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.194977E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.465 | TFLOPs: 29.16 | 15: iteration 17020/ 125429 | consumed samples: 4357120 | consumed tokens: 8923381760 | elapsed time per iteration (s): 1.02 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.203975E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.309 | TFLOPs: 41.37 | 15: iteration 17030/ 125429 | consumed samples: 4359680 | consumed tokens: 8928624640 | elapsed time per iteration (s): 1.08 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.202004E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.928 | TFLOPs: 39.15 | 15: iteration 17040/ 125429 | consumed samples: 4362240 | consumed tokens: 8933867520 | elapsed time per iteration (s): 1.03 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.200317E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.550 | TFLOPs: 40.91 | 15: iteration 17050/ 125429 | consumed samples: 4364800 | consumed tokens: 8939110400 | elapsed time per iteration (s): 1.03 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.208428E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.471 | TFLOPs: 41.23 | 15: iteration 17060/ 125429 | consumed samples: 4367360 | consumed tokens: 8944353280 | elapsed time per iteration (s): 1.03 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.136554E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.521 | TFLOPs: 40.90 | 15: iteration 17070/ 125429 | consumed samples: 4369920 | consumed tokens: 8949596160 | elapsed time per iteration (s): 1.09 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.189145E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.920 | TFLOPs: 38.66 | 15: iteration 17080/ 125429 | consumed samples: 4372480 | consumed tokens: 8954839040 | elapsed time per iteration (s): 1.05 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.192491E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.316 | TFLOPs: 40.21 | 15: iteration 17090/ 125429 | consumed samples: 4375040 | consumed tokens: 8960081920 | elapsed time per iteration (s): 1.04 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.188495E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.049 | TFLOPs: 40.50 | 15: iteration 17100/ 125429 | consumed samples: 4377600 | consumed tokens: 8965324800 | elapsed time per iteration (s): 1.04 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.185808E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.196 | TFLOPs: 40.52 | 15: iteration 17110/ 125429 | consumed samples: 4380160 | consumed tokens: 8970567680 | elapsed time per iteration (s): 1.04 | learning rate: 1.929E-04 | global batch size: 256 | lm loss: 2.175030E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.039 | TFLOPs: 40.49 | 15: iteration 17120/ 125429 | consumed samples: 4382720 | consumed tokens: 8975810560 | elapsed time per iteration (s): 1.04 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.218677E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.771 | TFLOPs: 40.62 | 15: iteration 17130/ 125429 | consumed samples: 4385280 | consumed tokens: 8981053440 | elapsed time per iteration (s): 1.05 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.169599E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.291 | TFLOPs: 40.21 | 15: iteration 17140/ 125429 | consumed samples: 4387840 | consumed tokens: 8986296320 | elapsed time per iteration (s): 1.03 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.181604E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.704 | TFLOPs: 41.27 | 15: iteration 17150/ 125429 | consumed samples: 4390400 | consumed tokens: 8991539200 | elapsed time per iteration (s): 1.06 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.184614E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.594 | TFLOPs: 39.76 | 15: iteration 17160/ 125429 | consumed samples: 4392960 | consumed tokens: 8996782080 | elapsed time per iteration (s): 1.03 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.177962E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.716 | TFLOPs: 40.94 | 15: iteration 17170/ 125429 | consumed samples: 4395520 | consumed tokens: 9002024960 | elapsed time per iteration (s): 1.06 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.200362E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.482 | TFLOPs: 40.07 | 15: iteration 17180/ 125429 | consumed samples: 4398080 | consumed tokens: 9007267840 | elapsed time per iteration (s): 1.03 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.177237E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.333 | TFLOPs: 41.04 | 15: iteration 17190/ 125429 | consumed samples: 4400640 | consumed tokens: 9012510720 | elapsed time per iteration (s): 1.03 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.203181E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.538 | TFLOPs: 41.07 | 15: iteration 17200/ 125429 | consumed samples: 4403200 | consumed tokens: 9017753600 | elapsed time per iteration (s): 1.05 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.147096E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.853 | TFLOPs: 40.13 | 15: iteration 17210/ 125429 | consumed samples: 4405760 | consumed tokens: 9022996480 | elapsed time per iteration (s): 1.04 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.165006E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.118 | TFLOPs: 40.84 | 15: iteration 17220/ 125429 | consumed samples: 4408320 | consumed tokens: 9028239360 | elapsed time per iteration (s): 1.06 | learning rate: 1.928E-04 | global batch size: 256 | lm loss: 2.164190E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.975 | TFLOPs: 39.99 | 15: iteration 17230/ 125429 | consumed samples: 4410880 | consumed tokens: 9033482240 | elapsed time per iteration (s): 1.03 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.205153E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.546 | TFLOPs: 41.07 | 15: iteration 17240/ 125429 | consumed samples: 4413440 | consumed tokens: 9038725120 | elapsed time per iteration (s): 1.07 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.205427E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.646 | TFLOPs: 39.44 | 15: iteration 17250/ 125429 | consumed samples: 4416000 | consumed tokens: 9043968000 | elapsed time per iteration (s): 1.03 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.173671E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.104 | TFLOPs: 41.00 | 15: iteration 17260/ 125429 | consumed samples: 4418560 | consumed tokens: 9049210880 | elapsed time per iteration (s): 1.04 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.183570E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.136 | TFLOPs: 40.51 | 15: iteration 17270/ 125429 | consumed samples: 4421120 | consumed tokens: 9054453760 | elapsed time per iteration (s): 1.06 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.204398E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.378 | TFLOPs: 39.89 | 15: iteration 17280/ 125429 | consumed samples: 4423680 | consumed tokens: 9059696640 | elapsed time per iteration (s): 1.03 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.186460E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.023 | TFLOPs: 40.99 | 15: iteration 17290/ 125429 | consumed samples: 4426240 | consumed tokens: 9064939520 | elapsed time per iteration (s): 1.04 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.191320E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.294 | TFLOPs: 40.54 | 15: iteration 17300/ 125429 | consumed samples: 4428800 | consumed tokens: 9070182400 | elapsed time per iteration (s): 1.04 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.194038E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.973 | TFLOPs: 40.65 | 15: iteration 17310/ 125429 | consumed samples: 4431360 | consumed tokens: 9075425280 | elapsed time per iteration (s): 1.07 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.164697E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.335 | TFLOPs: 39.72 | 15: iteration 17320/ 125429 | consumed samples: 4433920 | consumed tokens: 9080668160 | elapsed time per iteration (s): 1.07 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.173144E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.474 | TFLOPs: 39.57 | 15: iteration 17330/ 125429 | consumed samples: 4436480 | consumed tokens: 9085911040 | elapsed time per iteration (s): 1.03 | learning rate: 1.927E-04 | global batch size: 256 | lm loss: 2.195221E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.200 | TFLOPs: 41.02 | 15: iteration 17340/ 125429 | consumed samples: 4439040 | consumed tokens: 9091153920 | elapsed time per iteration (s): 1.03 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.171701E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.858 | TFLOPs: 40.96 | 15: iteration 17350/ 125429 | consumed samples: 4441600 | consumed tokens: 9096396800 | elapsed time per iteration (s): 1.04 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.164585E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.125 | TFLOPs: 40.84 | 15: iteration 17360/ 125429 | consumed samples: 4444160 | consumed tokens: 9101639680 | elapsed time per iteration (s): 1.03 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.156996E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.549 | TFLOPs: 40.91 | 15: iteration 17370/ 125429 | consumed samples: 4446720 | consumed tokens: 9106882560 | elapsed time per iteration (s): 1.04 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.162440E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.061 | TFLOPs: 40.66 | 15: iteration 17380/ 125429 | consumed samples: 4449280 | consumed tokens: 9112125440 | elapsed time per iteration (s): 1.05 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.192757E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.991 | TFLOPs: 40.16 | 15: iteration 17390/ 125429 | consumed samples: 4451840 | consumed tokens: 9117368320 | elapsed time per iteration (s): 1.04 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.211193E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.049 | TFLOPs: 40.66 | 15: iteration 17400/ 125429 | consumed samples: 4454400 | consumed tokens: 9122611200 | elapsed time per iteration (s): 1.06 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.188590E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.598 | TFLOPs: 39.76 | 15: iteration 17410/ 125429 | consumed samples: 4456960 | consumed tokens: 9127854080 | elapsed time per iteration (s): 1.05 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.178188E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.084 | TFLOPs: 40.34 | 15: iteration 17420/ 125429 | consumed samples: 4459520 | consumed tokens: 9133096960 | elapsed time per iteration (s): 1.03 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.179654E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.496 | TFLOPs: 40.90 | 15: iteration 17430/ 125429 | consumed samples: 4462080 | consumed tokens: 9138339840 | elapsed time per iteration (s): 1.05 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.199294E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.098 | TFLOPs: 40.17 | 15: iteration 17440/ 125429 | consumed samples: 4464640 | consumed tokens: 9143582720 | elapsed time per iteration (s): 1.05 | learning rate: 1.926E-04 | global batch size: 256 | lm loss: 2.166586E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.656 | TFLOPs: 40.43 | 15: iteration 17450/ 125429 | consumed samples: 4467200 | consumed tokens: 9148825600 | elapsed time per iteration (s): 1.06 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.193779E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.331 | TFLOPs: 40.05 | 15: iteration 17460/ 125429 | consumed samples: 4469760 | consumed tokens: 9154068480 | elapsed time per iteration (s): 1.05 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.198981E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.036 | TFLOPs: 40.33 | 15: iteration 17470/ 125429 | consumed samples: 4472320 | consumed tokens: 9159311360 | elapsed time per iteration (s): 1.03 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.151556E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.295 | TFLOPs: 41.20 | 15: iteration 17480/ 125429 | consumed samples: 4474880 | consumed tokens: 9164554240 | elapsed time per iteration (s): 1.07 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.184311E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.505 | TFLOPs: 39.41 | 15: iteration 17490/ 125429 | consumed samples: 4477440 | consumed tokens: 9169797120 | elapsed time per iteration (s): 1.04 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.205292E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.726 | TFLOPs: 40.77 | 15: iteration 17500/ 125429 | consumed samples: 4480000 | consumed tokens: 9175040000 | elapsed time per iteration (s): 1.06 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.159542E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.072 | TFLOPs: 40.00 | 15: iteration 17510/ 125429 | consumed samples: 4482560 | consumed tokens: 9180282880 | elapsed time per iteration (s): 1.05 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.163275E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.509 | TFLOPs: 40.24 | 15: iteration 17520/ 125429 | consumed samples: 4485120 | consumed tokens: 9185525760 | elapsed time per iteration (s): 1.05 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.184288E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.793 | TFLOPs: 40.45 | 15: iteration 17530/ 125429 | consumed samples: 4487680 | consumed tokens: 9190768640 | elapsed time per iteration (s): 1.05 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.213268E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.306 | TFLOPs: 40.21 | 15: iteration 17540/ 125429 | consumed samples: 4490240 | consumed tokens: 9196011520 | elapsed time per iteration (s): 1.04 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.214761E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.525 | TFLOPs: 40.74 | 15: iteration 17550/ 125429 | consumed samples: 4492800 | consumed tokens: 9201254400 | elapsed time per iteration (s): 1.06 | learning rate: 1.925E-04 | global batch size: 256 | lm loss: 2.189872E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.228 | TFLOPs: 40.03 | 15: iteration 17560/ 125429 | consumed samples: 4495360 | consumed tokens: 9206497280 | elapsed time per iteration (s): 1.03 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.168311E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.973 | TFLOPs: 40.98 | 15: iteration 17570/ 125429 | consumed samples: 4497920 | consumed tokens: 9211740160 | elapsed time per iteration (s): 1.07 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.168148E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.092 | TFLOPs: 39.68 | 15: iteration 17580/ 125429 | consumed samples: 4500480 | consumed tokens: 9216983040 | elapsed time per iteration (s): 1.06 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.193831E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.103 | TFLOPs: 39.84 | 15: iteration 17590/ 125429 | consumed samples: 4503040 | consumed tokens: 9222225920 | elapsed time per iteration (s): 1.07 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.165062E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.468 | TFLOPs: 39.41 | 15: iteration 17600/ 125429 | consumed samples: 4505600 | consumed tokens: 9227468800 | elapsed time per iteration (s): 1.03 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.200596E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.586 | TFLOPs: 41.25 | 15: iteration 17610/ 125429 | consumed samples: 4508160 | consumed tokens: 9232711680 | elapsed time per iteration (s): 1.05 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.186052E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.595 | TFLOPs: 40.26 | 15: iteration 17620/ 125429 | consumed samples: 4510720 | consumed tokens: 9237954560 | elapsed time per iteration (s): 1.05 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.195765E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.155 | TFLOPs: 40.18 | 15: iteration 17630/ 125429 | consumed samples: 4513280 | consumed tokens: 9243197440 | elapsed time per iteration (s): 1.03 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.166727E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.206 | TFLOPs: 41.18 | 15: iteration 17640/ 125429 | consumed samples: 4515840 | consumed tokens: 9248440320 | elapsed time per iteration (s): 1.05 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.166219E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.267 | TFLOPs: 40.37 | 15: iteration 17650/ 125429 | consumed samples: 4518400 | consumed tokens: 9253683200 | elapsed time per iteration (s): 1.05 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.189935E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.863 | TFLOPs: 40.47 | 15: iteration 17660/ 125429 | consumed samples: 4520960 | consumed tokens: 9258926080 | elapsed time per iteration (s): 1.03 | learning rate: 1.924E-04 | global batch size: 256 | lm loss: 2.192532E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.267 | TFLOPs: 41.19 | 15: iteration 17670/ 125429 | consumed samples: 4523520 | consumed tokens: 9264168960 | elapsed time per iteration (s): 1.05 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.142374E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.608 | TFLOPs: 40.26 | 15: iteration 17680/ 125429 | consumed samples: 4526080 | consumed tokens: 9269411840 | elapsed time per iteration (s): 1.05 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.178908E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.827 | TFLOPs: 40.46 | 15: iteration 17690/ 125429 | consumed samples: 4528640 | consumed tokens: 9274654720 | elapsed time per iteration (s): 1.05 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.208385E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.119 | TFLOPs: 40.18 | 15: iteration 17700/ 125429 | consumed samples: 4531200 | consumed tokens: 9279897600 | elapsed time per iteration (s): 1.06 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.186682E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.205 | TFLOPs: 40.03 | 15: iteration 17710/ 125429 | consumed samples: 4533760 | consumed tokens: 9285140480 | elapsed time per iteration (s): 1.04 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.177416E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.969 | TFLOPs: 40.81 | 15: iteration 17720/ 125429 | consumed samples: 4536320 | consumed tokens: 9290383360 | elapsed time per iteration (s): 1.05 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.183412E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.139 | TFLOPs: 40.35 | 15: iteration 17730/ 125429 | consumed samples: 4538880 | consumed tokens: 9295626240 | elapsed time per iteration (s): 1.09 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.141962E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.057 | TFLOPs: 38.84 | 15: iteration 17740/ 125429 | consumed samples: 4541440 | consumed tokens: 9300869120 | elapsed time per iteration (s): 1.07 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.179283E+00 | grad norm: 1.747 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.551 | TFLOPs: 39.59 | 15: iteration 17750/ 125429 | consumed samples: 4544000 | consumed tokens: 9306112000 | elapsed time per iteration (s): 1.13 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.599082E+00 | grad norm: 9.719 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.009 | TFLOPs: 37.52 | 15: iteration 17760/ 125429 | consumed samples: 4546560 | consumed tokens: 9311354880 | elapsed time per iteration (s): 1.07 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.489683E+00 | grad norm: 0.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.217 | TFLOPs: 39.37 | 15: iteration 17770/ 125429 | consumed samples: 4549120 | consumed tokens: 9316597760 | elapsed time per iteration (s): 1.07 | learning rate: 1.923E-04 | global batch size: 256 | lm loss: 2.306040E+00 | grad norm: 0.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.326 | TFLOPs: 39.55 | 15: iteration 17780/ 125429 | consumed samples: 4551680 | consumed tokens: 9321840640 | elapsed time per iteration (s): 1.05 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.250030E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.697 | TFLOPs: 40.11 | 15: iteration 17790/ 125429 | consumed samples: 4554240 | consumed tokens: 9327083520 | elapsed time per iteration (s): 1.05 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.192831E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.876 | TFLOPs: 40.30 | 15: iteration 17800/ 125429 | consumed samples: 4556800 | consumed tokens: 9332326400 | elapsed time per iteration (s): 1.05 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.216461E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.745 | TFLOPs: 40.45 | 15: iteration 17810/ 125429 | consumed samples: 4559360 | consumed tokens: 9337569280 | elapsed time per iteration (s): 1.06 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.206311E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.491 | TFLOPs: 39.91 | 15: iteration 17820/ 125429 | consumed samples: 4561920 | consumed tokens: 9342812160 | elapsed time per iteration (s): 1.05 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.196030E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.038 | TFLOPs: 40.16 | 15: iteration 17830/ 125429 | consumed samples: 4564480 | consumed tokens: 9348055040 | elapsed time per iteration (s): 1.07 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.204976E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.662 | TFLOPs: 39.44 | 15: iteration 17840/ 125429 | consumed samples: 4567040 | consumed tokens: 9353297920 | elapsed time per iteration (s): 1.04 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.187155E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.517 | TFLOPs: 40.57 | 15: iteration 17850/ 125429 | consumed samples: 4569600 | consumed tokens: 9358540800 | elapsed time per iteration (s): 1.04 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.186469E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.675 | TFLOPs: 40.60 | 15: iteration 17860/ 125429 | consumed samples: 4572160 | consumed tokens: 9363783680 | elapsed time per iteration (s): 1.06 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.174295E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.839 | TFLOPs: 39.97 | 15: iteration 17870/ 125429 | consumed samples: 4574720 | consumed tokens: 9369026560 | elapsed time per iteration (s): 1.06 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.196852E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.408 | TFLOPs: 40.06 | 15: iteration 17880/ 125429 | consumed samples: 4577280 | consumed tokens: 9374269440 | elapsed time per iteration (s): 1.11 | learning rate: 1.922E-04 | global batch size: 256 | lm loss: 2.196444E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.335 | TFLOPs: 38.23 | 15: iteration 17890/ 125429 | consumed samples: 4579840 | consumed tokens: 9379512320 | elapsed time per iteration (s): 1.04 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.167265E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.576 | TFLOPs: 40.58 | 15: iteration 17900/ 125429 | consumed samples: 4582400 | consumed tokens: 9384755200 | elapsed time per iteration (s): 1.06 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.223603E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.265 | TFLOPs: 40.04 | 15: iteration 17910/ 125429 | consumed samples: 4584960 | consumed tokens: 9389998080 | elapsed time per iteration (s): 1.09 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.137837E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.034 | TFLOPs: 38.84 | 15: iteration 17920/ 125429 | consumed samples: 4587520 | consumed tokens: 9395240960 | elapsed time per iteration (s): 1.11 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.180375E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.698 | TFLOPs: 37.96 | 15: iteration 17930/ 125429 | consumed samples: 4590080 | consumed tokens: 9400483840 | elapsed time per iteration (s): 1.04 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.198350E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.408 | TFLOPs: 40.72 | 15: iteration 17940/ 125429 | consumed samples: 4592640 | consumed tokens: 9405726720 | elapsed time per iteration (s): 1.07 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.184583E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.884 | TFLOPs: 39.64 | 15: iteration 17950/ 125429 | consumed samples: 4595200 | consumed tokens: 9410969600 | elapsed time per iteration (s): 1.03 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.193264E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.204 | TFLOPs: 41.02 | 15: iteration 17960/ 125429 | consumed samples: 4597760 | consumed tokens: 9416212480 | elapsed time per iteration (s): 1.04 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.169204E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.103 | TFLOPs: 40.67 | 15: iteration 17970/ 125429 | consumed samples: 4600320 | consumed tokens: 9421455360 | elapsed time per iteration (s): 1.03 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.177784E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.399 | TFLOPs: 41.22 | 15: iteration 17980/ 125429 | consumed samples: 4602880 | consumed tokens: 9426698240 | elapsed time per iteration (s): 1.05 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.195909E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.142 | TFLOPs: 40.35 | 15: iteration 17990/ 125429 | consumed samples: 4605440 | consumed tokens: 9431941120 | elapsed time per iteration (s): 1.08 | learning rate: 1.921E-04 | global batch size: 256 | lm loss: 2.181824E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.154 | TFLOPs: 39.19 | 0: [2022-11-26 01:09:13,626] [INFO] [logging.py:68:log_dist] [Rank 0] step=18000, skipped=0, lr=[0.00019204304299538178, 0.00019204304299538178, 0.00019204304299538178], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 18000/ 125429 | consumed samples: 4608000 | consumed tokens: 9437184000 | elapsed time per iteration (s): 1.11 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.183304E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.814 | TFLOPs: 37.98 | 0: steps: 18000 loss: 2.1472 iter time (s): 1.062 samples/sec: 241.068 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 18000 | lm loss value: 2.046705E+00 | lm loss PPL: 7.742352E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 18000 to checkpoints_1b5 0: [2022-11-26 01:09:14,074] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step18000 is begin to save! 0: [2022-11-26 01:09:14,082] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:09:14,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:09:14,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:09:14,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:09:14,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:09:14,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:09:14,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:09:14,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:09:14,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:09:14,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:09:14,857] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:09:14,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:09:14,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:09:15,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:09:15,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:09:15,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:09:15,202] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:09:15,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:09:15,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:09:15,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:09:15,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:09:15,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:09:15,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:09:15,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:09:15,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:09:15,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:09:15,774] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:09:15,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:09:15,886] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:09:15,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:09:16,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:09:16,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:09:16,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:09:16,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:09:16,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:09:16,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:09:16,333] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:09:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:09:16,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:09:16,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:09:16,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:09:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:09:16,664] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:09:16,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:09:16,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:09:16,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:09:16,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:09:16,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:09:16,998] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:09:17,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:09:17,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:09:17,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:09:17,220] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:09:17,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:09:17,329] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_29-model_00-model_states.pt... 0: [2022-11-26 01:09:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_29-model_00-model_states.pt. 0: [2022-11-26 01:09:17,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:09:17,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:09:17,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/layer_32-model_00-model_states.pt... 0: [2022-11-26 01:09:17,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/layer_32-model_00-model_states.pt. 0: [2022-11-26 01:09:17,555] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step18000/mp_rank_00_model_states.pt 0: [2022-11-26 01:09:17,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:09:17,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step18000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:09:17,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:09:17,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-26 01:09:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 01:09:17,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:09:17,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 01:09:17,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-26 01:09:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:09:17,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 01:09:17,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:09:17,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:09:17,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:09:17,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:09:17,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:09:17,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 01:09:17,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:09:17,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:09:17,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:09:17,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 01:09:17,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 01:09:17,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:09:17,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 01:09:17,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 01:09:17,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 01:09:17,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:09:17,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 5: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:09:17,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 4: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-26 01:09:17,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:09:17,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 01:09:17,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-26 01:09:17,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:09:17,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 01:09:17,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-26 01:09:17,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 01:09:17,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:09:17,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:09:17,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:09:17,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:09:17,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:09:17,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:09:17,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:09:17,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 1: [2022-11-26 01:09:17,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:09:17,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 01:09:17,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-26 01:09:17,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 01:09:17,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 12: [2022-11-26 01:09:17,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:09:17,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 01:09:17,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-26 01:09:17,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 01:09:17,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 2: [2022-11-26 01:09:17,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:09:17,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:09:17,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:09:17,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 15: [2022-11-26 01:09:17,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:09:17,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 01:09:17,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:09:17,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 6: [2022-11-26 01:09:17,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:09:17,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:09:17,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 3: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 13: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:09:17,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 11: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:09:17,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:09:17,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 8: [2022-11-26 01:09:17,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:09:17,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 01:09:17,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 14: [2022-11-26 01:09:17,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:09:17,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 01:09:17,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 7: [2022-11-26 01:09:17,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:09:17,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:09:17,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 9: [2022-11-26 01:09:17,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: [2022-11-26 01:09:17,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:09:17,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step18000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 10: [2022-11-26 01:09:17,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step18000 is ready now! 0: successfully saved checkpoint at iteration 18000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3911.46 15: iteration 18010/ 125429 | consumed samples: 4610560 | consumed tokens: 9442426880 | elapsed time per iteration (s): 1.46 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.200686E+00 | grad norm: 0.202 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.827 | TFLOPs: 29.06 | 15: iteration 18020/ 125429 | consumed samples: 4613120 | consumed tokens: 9447669760 | elapsed time per iteration (s): 1.04 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.192451E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.321 | TFLOPs: 40.87 | 15: iteration 18030/ 125429 | consumed samples: 4615680 | consumed tokens: 9452912640 | elapsed time per iteration (s): 1.08 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.160005E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.664 | TFLOPs: 39.28 | 15: iteration 18040/ 125429 | consumed samples: 4618240 | consumed tokens: 9458155520 | elapsed time per iteration (s): 1.05 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.184159E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.169 | TFLOPs: 40.35 | 15: iteration 18050/ 125429 | consumed samples: 4620800 | consumed tokens: 9463398400 | elapsed time per iteration (s): 1.05 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.180790E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.034 | TFLOPs: 40.33 | 15: iteration 18060/ 125429 | consumed samples: 4623360 | consumed tokens: 9468641280 | elapsed time per iteration (s): 1.05 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.158718E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.086 | TFLOPs: 40.34 | 15: iteration 18070/ 125429 | consumed samples: 4625920 | consumed tokens: 9473884160 | elapsed time per iteration (s): 1.05 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.191077E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.393 | TFLOPs: 40.39 | 15: iteration 18080/ 125429 | consumed samples: 4628480 | consumed tokens: 9479127040 | elapsed time per iteration (s): 1.05 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.165506E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.407 | TFLOPs: 40.39 | 15: iteration 18090/ 125429 | consumed samples: 4631040 | consumed tokens: 9484369920 | elapsed time per iteration (s): 1.05 | learning rate: 1.920E-04 | global batch size: 256 | lm loss: 2.191883E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.888 | TFLOPs: 40.30 | 15: iteration 18100/ 125429 | consumed samples: 4633600 | consumed tokens: 9489612800 | elapsed time per iteration (s): 1.04 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.175041E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.467 | TFLOPs: 40.73 | 15: iteration 18110/ 125429 | consumed samples: 4636160 | consumed tokens: 9494855680 | elapsed time per iteration (s): 1.07 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.206559E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.599 | TFLOPs: 39.43 | 15: iteration 18120/ 125429 | consumed samples: 4638720 | consumed tokens: 9500098560 | elapsed time per iteration (s): 1.02 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.205698E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.964 | TFLOPs: 41.31 | 15: iteration 18130/ 125429 | consumed samples: 4641280 | consumed tokens: 9505341440 | elapsed time per iteration (s): 1.07 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.191797E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.261 | TFLOPs: 39.70 | 15: iteration 18140/ 125429 | consumed samples: 4643840 | consumed tokens: 9510584320 | elapsed time per iteration (s): 1.05 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.163139E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.941 | TFLOPs: 40.48 | 15: iteration 18150/ 125429 | consumed samples: 4646400 | consumed tokens: 9515827200 | elapsed time per iteration (s): 1.04 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.153572E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.968 | TFLOPs: 40.65 | 15: iteration 18160/ 125429 | consumed samples: 4648960 | consumed tokens: 9521070080 | elapsed time per iteration (s): 1.06 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.137925E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.969 | TFLOPs: 39.99 | 15: iteration 18170/ 125429 | consumed samples: 4651520 | consumed tokens: 9526312960 | elapsed time per iteration (s): 1.05 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.159090E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.848 | TFLOPs: 40.13 | 15: iteration 18180/ 125429 | consumed samples: 4654080 | consumed tokens: 9531555840 | elapsed time per iteration (s): 1.03 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.167132E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.747 | TFLOPs: 40.94 | 15: iteration 18190/ 125429 | consumed samples: 4656640 | consumed tokens: 9536798720 | elapsed time per iteration (s): 1.06 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.162194E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.250 | TFLOPs: 39.87 | 15: iteration 18200/ 125429 | consumed samples: 4659200 | consumed tokens: 9542041600 | elapsed time per iteration (s): 1.07 | learning rate: 1.919E-04 | global batch size: 256 | lm loss: 2.192455E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.715 | TFLOPs: 39.45 | 15: iteration 18210/ 125429 | consumed samples: 4661760 | consumed tokens: 9547284480 | elapsed time per iteration (s): 1.04 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.142864E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.477 | TFLOPs: 40.73 | 15: iteration 18220/ 125429 | consumed samples: 4664320 | consumed tokens: 9552527360 | elapsed time per iteration (s): 1.08 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.201396E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.800 | TFLOPs: 39.30 | 15: iteration 18230/ 125429 | consumed samples: 4666880 | consumed tokens: 9557770240 | elapsed time per iteration (s): 1.05 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.161088E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.310 | TFLOPs: 40.37 | 15: iteration 18240/ 125429 | consumed samples: 4669440 | consumed tokens: 9563013120 | elapsed time per iteration (s): 1.05 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.153316E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.311 | TFLOPs: 40.37 | 15: iteration 18250/ 125429 | consumed samples: 4672000 | consumed tokens: 9568256000 | elapsed time per iteration (s): 1.03 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.173907E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.550 | TFLOPs: 41.24 | 15: iteration 18260/ 125429 | consumed samples: 4674560 | consumed tokens: 9573498880 | elapsed time per iteration (s): 1.06 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.143089E+00 | grad norm: 0.203 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.560 | TFLOPs: 39.92 | 15: iteration 18270/ 125429 | consumed samples: 4677120 | consumed tokens: 9578741760 | elapsed time per iteration (s): 1.04 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.161002E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.334 | TFLOPs: 40.54 | 15: iteration 18280/ 125429 | consumed samples: 4679680 | consumed tokens: 9583984640 | elapsed time per iteration (s): 1.04 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.204031E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.880 | TFLOPs: 40.80 | 15: iteration 18290/ 125429 | consumed samples: 4682240 | consumed tokens: 9589227520 | elapsed time per iteration (s): 1.05 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.189057E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.531 | TFLOPs: 40.41 | 15: iteration 18300/ 125429 | consumed samples: 4684800 | consumed tokens: 9594470400 | elapsed time per iteration (s): 1.07 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.203319E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.589 | TFLOPs: 39.59 | 15: iteration 18310/ 125429 | consumed samples: 4687360 | consumed tokens: 9599713280 | elapsed time per iteration (s): 1.04 | learning rate: 1.918E-04 | global batch size: 256 | lm loss: 2.210570E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.244 | TFLOPs: 40.69 | 15: iteration 18320/ 125429 | consumed samples: 4689920 | consumed tokens: 9604956160 | elapsed time per iteration (s): 1.04 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.160456E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.598 | TFLOPs: 40.75 | 15: iteration 18330/ 125429 | consumed samples: 4692480 | consumed tokens: 9610199040 | elapsed time per iteration (s): 1.04 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.179751E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.235 | TFLOPs: 40.53 | 15: iteration 18340/ 125429 | consumed samples: 4695040 | consumed tokens: 9615441920 | elapsed time per iteration (s): 1.07 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.182905E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.275 | TFLOPs: 39.71 | 15: iteration 18350/ 125429 | consumed samples: 4697600 | consumed tokens: 9620684800 | elapsed time per iteration (s): 1.06 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.184930E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.642 | TFLOPs: 39.93 | 15: iteration 18360/ 125429 | consumed samples: 4700160 | consumed tokens: 9625927680 | elapsed time per iteration (s): 2.86 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.151379E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 89.641 | TFLOPs: 14.81 | 15: iteration 18370/ 125429 | consumed samples: 4702720 | consumed tokens: 9631170560 | elapsed time per iteration (s): 1.04 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.209875E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.865 | TFLOPs: 40.63 | 15: iteration 18380/ 125429 | consumed samples: 4705280 | consumed tokens: 9636413440 | elapsed time per iteration (s): 1.04 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.179204E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.324 | TFLOPs: 40.71 | 15: iteration 18390/ 125429 | consumed samples: 4707840 | consumed tokens: 9641656320 | elapsed time per iteration (s): 1.07 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.198023E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.940 | TFLOPs: 39.65 | 15: iteration 18400/ 125429 | consumed samples: 4710400 | consumed tokens: 9646899200 | elapsed time per iteration (s): 1.05 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.167764E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.323 | TFLOPs: 40.38 | 15: iteration 18410/ 125429 | consumed samples: 4712960 | consumed tokens: 9652142080 | elapsed time per iteration (s): 1.05 | learning rate: 1.917E-04 | global batch size: 256 | lm loss: 2.147169E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.066 | TFLOPs: 40.17 | 15: iteration 18420/ 125429 | consumed samples: 4715520 | consumed tokens: 9657384960 | elapsed time per iteration (s): 1.04 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.171743E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.449 | TFLOPs: 40.56 | 15: iteration 18430/ 125429 | consumed samples: 4718080 | consumed tokens: 9662627840 | elapsed time per iteration (s): 1.02 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.187986E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.439 | TFLOPs: 41.39 | 15: iteration 18440/ 125429 | consumed samples: 4720640 | consumed tokens: 9667870720 | elapsed time per iteration (s): 1.10 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.194507E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.564 | TFLOPs: 38.60 | 15: iteration 18450/ 125429 | consumed samples: 4723200 | consumed tokens: 9673113600 | elapsed time per iteration (s): 1.08 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.158665E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.943 | TFLOPs: 39.32 | 15: iteration 18460/ 125429 | consumed samples: 4725760 | consumed tokens: 9678356480 | elapsed time per iteration (s): 1.08 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.145637E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.042 | TFLOPs: 39.01 | 15: iteration 18470/ 125429 | consumed samples: 4728320 | consumed tokens: 9683599360 | elapsed time per iteration (s): 1.09 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.164928E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.469 | TFLOPs: 38.75 | 15: iteration 18480/ 125429 | consumed samples: 4730880 | consumed tokens: 9688842240 | elapsed time per iteration (s): 1.07 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.152671E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.103 | TFLOPs: 39.68 | 15: iteration 18490/ 125429 | consumed samples: 4733440 | consumed tokens: 9694085120 | elapsed time per iteration (s): 1.08 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.169969E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.079 | TFLOPs: 39.34 | 15: iteration 18500/ 125429 | consumed samples: 4736000 | consumed tokens: 9699328000 | elapsed time per iteration (s): 1.05 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.179341E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.268 | TFLOPs: 40.37 | 15: iteration 18510/ 125429 | consumed samples: 4738560 | consumed tokens: 9704570880 | elapsed time per iteration (s): 1.03 | learning rate: 1.916E-04 | global batch size: 256 | lm loss: 2.167352E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.701 | TFLOPs: 40.93 | 15: iteration 18520/ 125429 | consumed samples: 4741120 | consumed tokens: 9709813760 | elapsed time per iteration (s): 1.05 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.145551E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.799 | TFLOPs: 40.45 | 15: iteration 18530/ 125429 | consumed samples: 4743680 | consumed tokens: 9715056640 | elapsed time per iteration (s): 1.02 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.169655E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.525 | TFLOPs: 41.40 | 15: iteration 18540/ 125429 | consumed samples: 4746240 | consumed tokens: 9720299520 | elapsed time per iteration (s): 1.07 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.188557E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.681 | TFLOPs: 39.44 | 15: iteration 18550/ 125429 | consumed samples: 4748800 | consumed tokens: 9725542400 | elapsed time per iteration (s): 1.03 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.189788E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.151 | TFLOPs: 41.17 | 15: iteration 18560/ 125429 | consumed samples: 4751360 | consumed tokens: 9730785280 | elapsed time per iteration (s): 1.05 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.181272E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.957 | TFLOPs: 40.15 | 15: iteration 18570/ 125429 | consumed samples: 4753920 | consumed tokens: 9736028160 | elapsed time per iteration (s): 1.13 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.144173E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.013 | TFLOPs: 37.52 | 15: iteration 18580/ 125429 | consumed samples: 4756480 | consumed tokens: 9741271040 | elapsed time per iteration (s): 1.06 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.184995E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.545 | TFLOPs: 39.75 | 15: iteration 18590/ 125429 | consumed samples: 4759040 | consumed tokens: 9746513920 | elapsed time per iteration (s): 1.04 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.163225E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.916 | TFLOPs: 40.80 | 15: iteration 18600/ 125429 | consumed samples: 4761600 | consumed tokens: 9751756800 | elapsed time per iteration (s): 1.03 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.160863E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.060 | TFLOPs: 41.16 | 15: iteration 18610/ 125429 | consumed samples: 4764160 | consumed tokens: 9756999680 | elapsed time per iteration (s): 1.04 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.178937E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.944 | TFLOPs: 40.64 | 15: iteration 18620/ 125429 | consumed samples: 4766720 | consumed tokens: 9762242560 | elapsed time per iteration (s): 1.02 | learning rate: 1.915E-04 | global batch size: 256 | lm loss: 2.171563E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.523 | TFLOPs: 41.40 | 15: iteration 18630/ 125429 | consumed samples: 4769280 | consumed tokens: 9767485440 | elapsed time per iteration (s): 1.11 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.181422E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.341 | TFLOPs: 38.07 | 15: iteration 18640/ 125429 | consumed samples: 4771840 | consumed tokens: 9772728320 | elapsed time per iteration (s): 1.04 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.171572E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.042 | TFLOPs: 40.83 | 15: iteration 18650/ 125429 | consumed samples: 4774400 | consumed tokens: 9777971200 | elapsed time per iteration (s): 1.03 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.169953E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.658 | TFLOPs: 40.93 | 15: iteration 18660/ 125429 | consumed samples: 4776960 | consumed tokens: 9783214080 | elapsed time per iteration (s): 1.04 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.193463E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.635 | TFLOPs: 40.59 | 15: iteration 18670/ 125429 | consumed samples: 4779520 | consumed tokens: 9788456960 | elapsed time per iteration (s): 1.06 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.150024E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.867 | TFLOPs: 39.97 | 15: iteration 18680/ 125429 | consumed samples: 4782080 | consumed tokens: 9793699840 | elapsed time per iteration (s): 1.05 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.213972E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.229 | TFLOPs: 40.20 | 15: iteration 18690/ 125429 | consumed samples: 4784640 | consumed tokens: 9798942720 | elapsed time per iteration (s): 1.08 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.167429E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.903 | TFLOPs: 39.32 | 15: iteration 18700/ 125429 | consumed samples: 4787200 | consumed tokens: 9804185600 | elapsed time per iteration (s): 1.06 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.176209E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.386 | TFLOPs: 39.73 | 15: iteration 18710/ 125429 | consumed samples: 4789760 | consumed tokens: 9809428480 | elapsed time per iteration (s): 1.03 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.185270E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.384 | TFLOPs: 40.88 | 15: iteration 18720/ 125429 | consumed samples: 4792320 | consumed tokens: 9814671360 | elapsed time per iteration (s): 1.05 | learning rate: 1.914E-04 | global batch size: 256 | lm loss: 2.173659E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.523 | TFLOPs: 40.41 | 15: iteration 18730/ 125429 | consumed samples: 4794880 | consumed tokens: 9819914240 | elapsed time per iteration (s): 1.04 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.172481E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.995 | TFLOPs: 40.65 | 15: iteration 18740/ 125429 | consumed samples: 4797440 | consumed tokens: 9825157120 | elapsed time per iteration (s): 1.03 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.181920E+00 | grad norm: 0.258 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.786 | TFLOPs: 41.11 | 15: iteration 18750/ 125429 | consumed samples: 4800000 | consumed tokens: 9830400000 | elapsed time per iteration (s): 1.03 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.172674E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.488 | TFLOPs: 40.90 | 15: iteration 18760/ 125429 | consumed samples: 4802560 | consumed tokens: 9835642880 | elapsed time per iteration (s): 1.10 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.187866E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.931 | TFLOPs: 38.49 | 15: iteration 18770/ 125429 | consumed samples: 4805120 | consumed tokens: 9840885760 | elapsed time per iteration (s): 1.04 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.133642E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.815 | TFLOPs: 40.79 | 15: iteration 18780/ 125429 | consumed samples: 4807680 | consumed tokens: 9846128640 | elapsed time per iteration (s): 1.03 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.191910E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.533 | TFLOPs: 40.91 | 15: iteration 18790/ 125429 | consumed samples: 4810240 | consumed tokens: 9851371520 | elapsed time per iteration (s): 1.06 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.169219E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.698 | TFLOPs: 39.78 | 15: iteration 18800/ 125429 | consumed samples: 4812800 | consumed tokens: 9856614400 | elapsed time per iteration (s): 1.05 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.177180E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.655 | TFLOPs: 40.10 | 15: iteration 18810/ 125429 | consumed samples: 4815360 | consumed tokens: 9861857280 | elapsed time per iteration (s): 1.02 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.139507E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.658 | TFLOPs: 41.42 | 15: iteration 18820/ 125429 | consumed samples: 4817920 | consumed tokens: 9867100160 | elapsed time per iteration (s): 1.05 | learning rate: 1.913E-04 | global batch size: 256 | lm loss: 2.180970E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.357 | TFLOPs: 40.38 | 15: iteration 18830/ 125429 | consumed samples: 4820480 | consumed tokens: 9872343040 | elapsed time per iteration (s): 1.05 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.173013E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.212 | TFLOPs: 40.19 | 15: iteration 18840/ 125429 | consumed samples: 4823040 | consumed tokens: 9877585920 | elapsed time per iteration (s): 1.04 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.136990E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.449 | TFLOPs: 40.56 | 15: iteration 18850/ 125429 | consumed samples: 4825600 | consumed tokens: 9882828800 | elapsed time per iteration (s): 1.03 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.165967E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.516 | TFLOPs: 41.07 | 15: iteration 18860/ 125429 | consumed samples: 4828160 | consumed tokens: 9888071680 | elapsed time per iteration (s): 1.06 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.163955E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.478 | TFLOPs: 40.07 | 15: iteration 18870/ 125429 | consumed samples: 4830720 | consumed tokens: 9893314560 | elapsed time per iteration (s): 1.07 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.171960E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.407 | TFLOPs: 39.56 | 15: iteration 18880/ 125429 | consumed samples: 4833280 | consumed tokens: 9898557440 | elapsed time per iteration (s): 1.06 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.168502E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.419 | TFLOPs: 39.90 | 15: iteration 18890/ 125429 | consumed samples: 4835840 | consumed tokens: 9903800320 | elapsed time per iteration (s): 1.08 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.189452E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.998 | TFLOPs: 39.17 | 15: iteration 18900/ 125429 | consumed samples: 4838400 | consumed tokens: 9909043200 | elapsed time per iteration (s): 1.04 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.160139E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.626 | TFLOPs: 40.76 | 15: iteration 18910/ 125429 | consumed samples: 4840960 | consumed tokens: 9914286080 | elapsed time per iteration (s): 1.03 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.173909E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.364 | TFLOPs: 40.88 | 15: iteration 18920/ 125429 | consumed samples: 4843520 | consumed tokens: 9919528960 | elapsed time per iteration (s): 1.03 | learning rate: 1.912E-04 | global batch size: 256 | lm loss: 2.147959E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.848 | TFLOPs: 40.96 | 15: iteration 18930/ 125429 | consumed samples: 4846080 | consumed tokens: 9924771840 | elapsed time per iteration (s): 1.05 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.164215E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.042 | TFLOPs: 40.33 | 15: iteration 18940/ 125429 | consumed samples: 4848640 | consumed tokens: 9930014720 | elapsed time per iteration (s): 1.08 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.205949E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.381 | TFLOPs: 39.23 | 15: iteration 18950/ 125429 | consumed samples: 4851200 | consumed tokens: 9935257600 | elapsed time per iteration (s): 1.04 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.165612E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.668 | TFLOPs: 40.60 | 15: iteration 18960/ 125429 | consumed samples: 4853760 | consumed tokens: 9940500480 | elapsed time per iteration (s): 1.03 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.137908E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.054 | TFLOPs: 40.99 | 15: iteration 18970/ 125429 | consumed samples: 4856320 | consumed tokens: 9945743360 | elapsed time per iteration (s): 1.11 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.187616E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.635 | TFLOPs: 37.95 | 15: iteration 18980/ 125429 | consumed samples: 4858880 | consumed tokens: 9950986240 | elapsed time per iteration (s): 1.05 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.185171E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.764 | TFLOPs: 40.28 | 15: iteration 18990/ 125429 | consumed samples: 4861440 | consumed tokens: 9956229120 | elapsed time per iteration (s): 1.08 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.176439E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.134 | TFLOPs: 39.35 | 15: iteration 19000/ 125429 | consumed samples: 4864000 | consumed tokens: 9961472000 | elapsed time per iteration (s): 1.08 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.189087E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.963 | TFLOPs: 39.33 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 19000 | lm loss value: 2.117363E+00 | lm loss PPL: 8.309199E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 19000 to checkpoints_1b5 0: [2022-11-26 01:27:08,868] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step19000 is begin to save! 0: [2022-11-26 01:27:08,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:27:09,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:27:09,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:27:09,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:27:09,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:27:09,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:27:09,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:27:09,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:27:09,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:27:09,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:27:09,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:27:09,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:27:09,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:27:09,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:27:09,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:27:09,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:27:09,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:27:09,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:27:09,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:27:10,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:27:10,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:27:10,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:27:10,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:27:10,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:27:10,298] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:27:10,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:27:10,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:27:10,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:27:10,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:27:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:27:10,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:27:10,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:27:10,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:27:10,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:27:10,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:27:10,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:27:10,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:27:11,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:27:11,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:27:11,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:27:11,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:27:11,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:27:11,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:27:11,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:27:11,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:27:11,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:27:11,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:27:11,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:27:11,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:27:11,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:27:11,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:27:11,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:27:11,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:27:11,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:27:11,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_29-model_00-model_states.pt... 0: [2022-11-26 01:27:11,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_29-model_00-model_states.pt. 0: [2022-11-26 01:27:11,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:27:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:27:12,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/layer_32-model_00-model_states.pt... 0: [2022-11-26 01:27:12,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/layer_32-model_00-model_states.pt. 0: [2022-11-26 01:27:12,083] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step19000/mp_rank_00_model_states.pt 0: [2022-11-26 01:27:12,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:27:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:27:12,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step19000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:27:12,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 01:27:12,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:27:12,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 01:27:12,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 01:27:12,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 01:27:12,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 01:27:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 1: [2022-11-26 01:27:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 01:27:12,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:27:12,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 01:27:12,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 01:27:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:27:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:27:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:27:12,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 01:27:12,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 01:27:12,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 01:27:12,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:27:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 01:27:12,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:27:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:27:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 6: [2022-11-26 01:27:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 01:27:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 01:27:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:27:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 3: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:27:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 10: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:27:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 01:27:12,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:27:12,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:27:12,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 01:27:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 01:27:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 2: [2022-11-26 01:27:12,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:27:12,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 01:27:12,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 01:27:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:27:12,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 01:27:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:27:12,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:27:12,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 01:27:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:27:12,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 6: [2022-11-26 01:27:12,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:27:12,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 01:27:12,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:27:12,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 01:27:12,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 01:27:12,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:27:12,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 01:27:12,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:27:12,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:27:12,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 1: [2022-11-26 01:27:12,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 4: [2022-11-26 01:27:12,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:27:12,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:27:12,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 7: [2022-11-26 01:27:12,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:27:12,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 01:27:12,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 13: [2022-11-26 01:27:12,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:27:12,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 01:27:12,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:27:12,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:27:12,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:27:12,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 9: [2022-11-26 01:27:12,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:27:12,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 01:27:12,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 5: [2022-11-26 01:27:12,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:27:12,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:27:12,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 14: [2022-11-26 01:27:12,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:27:12,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:27:12,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 11: [2022-11-26 01:27:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 15: [2022-11-26 01:27:12,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:27:12,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 01:27:12,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 01:27:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 12: [2022-11-26 01:27:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: [2022-11-26 01:27:12,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:27:12,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step19000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 8: [2022-11-26 01:27:12,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step19000 is ready now! 0: successfully saved checkpoint at iteration 19000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3682.23 15: iteration 19010/ 125429 | consumed samples: 4866560 | consumed tokens: 9966714880 | elapsed time per iteration (s): 1.45 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.191893E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.778 | TFLOPs: 29.21 | 15: iteration 19020/ 125429 | consumed samples: 4869120 | consumed tokens: 9971957760 | elapsed time per iteration (s): 1.04 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.184967E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.190 | TFLOPs: 40.68 | 15: iteration 19030/ 125429 | consumed samples: 4871680 | consumed tokens: 9977200640 | elapsed time per iteration (s): 1.03 | learning rate: 1.911E-04 | global batch size: 256 | lm loss: 2.152208E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.044 | TFLOPs: 41.16 | 15: iteration 19040/ 125429 | consumed samples: 4874240 | consumed tokens: 9982443520 | elapsed time per iteration (s): 1.08 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.188837E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.333 | TFLOPs: 39.06 | 15: iteration 19050/ 125429 | consumed samples: 4876800 | consumed tokens: 9987686400 | elapsed time per iteration (s): 1.06 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.157768E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.579 | TFLOPs: 40.09 | 15: iteration 19060/ 125429 | consumed samples: 4879360 | consumed tokens: 9992929280 | elapsed time per iteration (s): 1.06 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.156668E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.509 | TFLOPs: 39.91 | 15: iteration 19070/ 125429 | consumed samples: 4881920 | consumed tokens: 9998172160 | elapsed time per iteration (s): 1.02 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.145354E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.055 | TFLOPs: 41.32 | 15: iteration 19080/ 125429 | consumed samples: 4884480 | consumed tokens: 10003415040 | elapsed time per iteration (s): 1.06 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.173560E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.281 | TFLOPs: 40.04 | 15: iteration 19090/ 125429 | consumed samples: 4887040 | consumed tokens: 10008657920 | elapsed time per iteration (s): 1.03 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.192544E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.498 | TFLOPs: 41.23 | 15: iteration 19100/ 125429 | consumed samples: 4889600 | consumed tokens: 10013900800 | elapsed time per iteration (s): 1.04 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.194510E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.232 | TFLOPs: 40.53 | 15: iteration 19110/ 125429 | consumed samples: 4892160 | consumed tokens: 10019143680 | elapsed time per iteration (s): 1.02 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.197092E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.016 | TFLOPs: 41.32 | 15: iteration 19120/ 125429 | consumed samples: 4894720 | consumed tokens: 10024386560 | elapsed time per iteration (s): 1.08 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.160751E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.649 | TFLOPs: 39.11 | 15: iteration 19130/ 125429 | consumed samples: 4897280 | consumed tokens: 10029629440 | elapsed time per iteration (s): 1.07 | learning rate: 1.910E-04 | global batch size: 256 | lm loss: 2.155007E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.437 | TFLOPs: 39.57 | 15: iteration 19140/ 125429 | consumed samples: 4899840 | consumed tokens: 10034872320 | elapsed time per iteration (s): 1.09 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.159592E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.034 | TFLOPs: 38.84 | 15: iteration 19150/ 125429 | consumed samples: 4902400 | consumed tokens: 10040115200 | elapsed time per iteration (s): 1.04 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.154418E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.029 | TFLOPs: 40.49 | 15: iteration 19160/ 125429 | consumed samples: 4904960 | consumed tokens: 10045358080 | elapsed time per iteration (s): 1.05 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.179743E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.066 | TFLOPs: 40.17 | 15: iteration 19170/ 125429 | consumed samples: 4907520 | consumed tokens: 10050600960 | elapsed time per iteration (s): 1.03 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.156591E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.133 | TFLOPs: 41.17 | 15: iteration 19180/ 125429 | consumed samples: 4910080 | consumed tokens: 10055843840 | elapsed time per iteration (s): 1.06 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.172266E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.417 | TFLOPs: 39.73 | 15: iteration 19190/ 125429 | consumed samples: 4912640 | consumed tokens: 10061086720 | elapsed time per iteration (s): 1.09 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.171520E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.668 | TFLOPs: 38.78 | 15: iteration 19200/ 125429 | consumed samples: 4915200 | consumed tokens: 10066329600 | elapsed time per iteration (s): 1.04 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.174248E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.735 | TFLOPs: 40.61 | 15: iteration 19210/ 125429 | consumed samples: 4917760 | consumed tokens: 10071572480 | elapsed time per iteration (s): 1.05 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.163251E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.000 | TFLOPs: 40.32 | 15: iteration 19220/ 125429 | consumed samples: 4920320 | consumed tokens: 10076815360 | elapsed time per iteration (s): 1.04 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.168185E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.277 | TFLOPs: 40.53 | 15: iteration 19230/ 125429 | consumed samples: 4922880 | consumed tokens: 10082058240 | elapsed time per iteration (s): 1.06 | learning rate: 1.909E-04 | global batch size: 256 | lm loss: 2.168361E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.948 | TFLOPs: 39.82 | 15: iteration 19240/ 125429 | consumed samples: 4925440 | consumed tokens: 10087301120 | elapsed time per iteration (s): 1.04 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.172797E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.814 | TFLOPs: 40.62 | 15: iteration 19250/ 125429 | consumed samples: 4928000 | consumed tokens: 10092544000 | elapsed time per iteration (s): 1.03 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.195585E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.582 | TFLOPs: 41.08 | 15: iteration 19260/ 125429 | consumed samples: 4930560 | consumed tokens: 10097786880 | elapsed time per iteration (s): 1.34 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.159882E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 191.348 | TFLOPs: 31.62 | 15: iteration 19270/ 125429 | consumed samples: 4933120 | consumed tokens: 10103029760 | elapsed time per iteration (s): 1.03 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.141277E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.028 | TFLOPs: 40.99 | 15: iteration 19280/ 125429 | consumed samples: 4935680 | consumed tokens: 10108272640 | elapsed time per iteration (s): 1.04 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.160271E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.859 | TFLOPs: 40.63 | 15: iteration 19290/ 125429 | consumed samples: 4938240 | consumed tokens: 10113515520 | elapsed time per iteration (s): 1.05 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.174516E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.024 | TFLOPs: 40.16 | 15: iteration 19300/ 125429 | consumed samples: 4940800 | consumed tokens: 10118758400 | elapsed time per iteration (s): 1.03 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.163926E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.747 | TFLOPs: 40.94 | 15: iteration 19310/ 125429 | consumed samples: 4943360 | consumed tokens: 10124001280 | elapsed time per iteration (s): 1.04 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.156078E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.190 | TFLOPs: 40.68 | 15: iteration 19320/ 125429 | consumed samples: 4945920 | consumed tokens: 10129244160 | elapsed time per iteration (s): 1.04 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.146439E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.084 | TFLOPs: 40.67 | 15: iteration 19330/ 125429 | consumed samples: 4948480 | consumed tokens: 10134487040 | elapsed time per iteration (s): 1.05 | learning rate: 1.908E-04 | global batch size: 256 | lm loss: 2.142122E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.900 | TFLOPs: 40.47 | 15: iteration 19340/ 125429 | consumed samples: 4951040 | consumed tokens: 10139729920 | elapsed time per iteration (s): 1.03 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.167968E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.496 | TFLOPs: 40.90 | 15: iteration 19350/ 125429 | consumed samples: 4953600 | consumed tokens: 10144972800 | elapsed time per iteration (s): 1.05 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.159716E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.364 | TFLOPs: 40.22 | 15: iteration 19360/ 125429 | consumed samples: 4956160 | consumed tokens: 10150215680 | elapsed time per iteration (s): 1.05 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.154339E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.063 | TFLOPs: 40.33 | 15: iteration 19370/ 125429 | consumed samples: 4958720 | consumed tokens: 10155458560 | elapsed time per iteration (s): 1.03 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.158099E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.068 | TFLOPs: 41.00 | 15: iteration 19380/ 125429 | consumed samples: 4961280 | consumed tokens: 10160701440 | elapsed time per iteration (s): 1.04 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.141896E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.111 | TFLOPs: 40.51 | 15: iteration 19390/ 125429 | consumed samples: 4963840 | consumed tokens: 10165944320 | elapsed time per iteration (s): 1.06 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.165927E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.819 | TFLOPs: 39.96 | 15: iteration 19400/ 125429 | consumed samples: 4966400 | consumed tokens: 10171187200 | elapsed time per iteration (s): 1.03 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.166687E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.512 | TFLOPs: 41.07 | 15: iteration 19410/ 125429 | consumed samples: 4968960 | consumed tokens: 10176430080 | elapsed time per iteration (s): 1.03 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.167098E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.635 | TFLOPs: 40.92 | 15: iteration 19420/ 125429 | consumed samples: 4971520 | consumed tokens: 10181672960 | elapsed time per iteration (s): 1.04 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.174394E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.182 | TFLOPs: 40.85 | 15: iteration 19430/ 125429 | consumed samples: 4974080 | consumed tokens: 10186915840 | elapsed time per iteration (s): 1.03 | learning rate: 1.907E-04 | global batch size: 256 | lm loss: 2.184279E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.111 | TFLOPs: 41.00 | 15: iteration 19440/ 125429 | consumed samples: 4976640 | consumed tokens: 10192158720 | elapsed time per iteration (s): 1.03 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.167204E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.395 | TFLOPs: 41.21 | 15: iteration 19450/ 125429 | consumed samples: 4979200 | consumed tokens: 10197401600 | elapsed time per iteration (s): 1.04 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.181745E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.479 | TFLOPs: 40.73 | 15: iteration 19460/ 125429 | consumed samples: 4981760 | consumed tokens: 10202644480 | elapsed time per iteration (s): 1.07 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.155452E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.624 | TFLOPs: 39.43 | 15: iteration 19470/ 125429 | consumed samples: 4984320 | consumed tokens: 10207887360 | elapsed time per iteration (s): 1.03 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.140831E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.467 | TFLOPs: 40.90 | 15: iteration 19480/ 125429 | consumed samples: 4986880 | consumed tokens: 10213130240 | elapsed time per iteration (s): 1.03 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.148618E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.867 | TFLOPs: 40.96 | 15: iteration 19490/ 125429 | consumed samples: 4989440 | consumed tokens: 10218373120 | elapsed time per iteration (s): 1.07 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.166324E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.977 | TFLOPs: 39.66 | 15: iteration 19500/ 125429 | consumed samples: 4992000 | consumed tokens: 10223616000 | elapsed time per iteration (s): 1.04 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.151045E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.276 | TFLOPs: 40.53 | 15: iteration 19510/ 125429 | consumed samples: 4994560 | consumed tokens: 10228858880 | elapsed time per iteration (s): 1.05 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.176042E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.255 | TFLOPs: 40.20 | 15: iteration 19520/ 125429 | consumed samples: 4997120 | consumed tokens: 10234101760 | elapsed time per iteration (s): 1.04 | learning rate: 1.906E-04 | global batch size: 256 | lm loss: 2.174304E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.201 | TFLOPs: 40.69 | 15: iteration 19530/ 125429 | consumed samples: 4999680 | consumed tokens: 10239344640 | elapsed time per iteration (s): 1.04 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.147847E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.268 | TFLOPs: 40.86 | 15: iteration 19540/ 125429 | consumed samples: 5002240 | consumed tokens: 10244587520 | elapsed time per iteration (s): 1.06 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.166465E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.022 | TFLOPs: 39.83 | 15: iteration 19550/ 125429 | consumed samples: 5004800 | consumed tokens: 10249830400 | elapsed time per iteration (s): 1.04 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.189794E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.562 | TFLOPs: 40.75 | 15: iteration 19560/ 125429 | consumed samples: 5007360 | consumed tokens: 10255073280 | elapsed time per iteration (s): 1.03 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.157002E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.566 | TFLOPs: 41.24 | 15: iteration 19570/ 125429 | consumed samples: 5009920 | consumed tokens: 10260316160 | elapsed time per iteration (s): 1.02 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.145998E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.959 | TFLOPs: 41.47 | 15: iteration 19580/ 125429 | consumed samples: 5012480 | consumed tokens: 10265559040 | elapsed time per iteration (s): 1.05 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.168964E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.260 | TFLOPs: 40.37 | 15: iteration 19590/ 125429 | consumed samples: 5015040 | consumed tokens: 10270801920 | elapsed time per iteration (s): 1.05 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.144648E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.680 | TFLOPs: 40.44 | 15: iteration 19600/ 125429 | consumed samples: 5017600 | consumed tokens: 10276044800 | elapsed time per iteration (s): 1.05 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.137759E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.872 | TFLOPs: 40.47 | 15: iteration 19610/ 125429 | consumed samples: 5020160 | consumed tokens: 10281287680 | elapsed time per iteration (s): 1.04 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.167233E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.090 | TFLOPs: 40.67 | 15: iteration 19620/ 125429 | consumed samples: 5022720 | consumed tokens: 10286530560 | elapsed time per iteration (s): 1.04 | learning rate: 1.905E-04 | global batch size: 256 | lm loss: 2.173199E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.045 | TFLOPs: 40.66 | 15: iteration 19630/ 125429 | consumed samples: 5025280 | consumed tokens: 10291773440 | elapsed time per iteration (s): 1.04 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.170481E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.018 | TFLOPs: 40.82 | 15: iteration 19640/ 125429 | consumed samples: 5027840 | consumed tokens: 10297016320 | elapsed time per iteration (s): 1.02 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.185884E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.478 | TFLOPs: 41.39 | 15: iteration 19650/ 125429 | consumed samples: 5030400 | consumed tokens: 10302259200 | elapsed time per iteration (s): 1.02 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.134654E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.249 | TFLOPs: 41.36 | 15: iteration 19660/ 125429 | consumed samples: 5032960 | consumed tokens: 10307502080 | elapsed time per iteration (s): 1.10 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.164165E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.248 | TFLOPs: 38.38 | 15: iteration 19670/ 125429 | consumed samples: 5035520 | consumed tokens: 10312744960 | elapsed time per iteration (s): 1.04 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.150088E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.030 | TFLOPs: 40.82 | 15: iteration 19680/ 125429 | consumed samples: 5038080 | consumed tokens: 10317987840 | elapsed time per iteration (s): 1.05 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.141510E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.731 | TFLOPs: 40.44 | 15: iteration 19690/ 125429 | consumed samples: 5040640 | consumed tokens: 10323230720 | elapsed time per iteration (s): 1.04 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.146965E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.059 | TFLOPs: 40.50 | 15: iteration 19700/ 125429 | consumed samples: 5043200 | consumed tokens: 10328473600 | elapsed time per iteration (s): 1.04 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.156513E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.982 | TFLOPs: 40.65 | 15: iteration 19710/ 125429 | consumed samples: 5045760 | consumed tokens: 10333716480 | elapsed time per iteration (s): 1.05 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.149173E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.837 | TFLOPs: 40.46 | 15: iteration 19720/ 125429 | consumed samples: 5048320 | consumed tokens: 10338959360 | elapsed time per iteration (s): 1.03 | learning rate: 1.904E-04 | global batch size: 256 | lm loss: 2.137574E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.321 | TFLOPs: 41.04 | 15: iteration 19730/ 125429 | consumed samples: 5050880 | consumed tokens: 10344202240 | elapsed time per iteration (s): 1.10 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.154238E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.594 | TFLOPs: 38.44 | 15: iteration 19740/ 125429 | consumed samples: 5053440 | consumed tokens: 10349445120 | elapsed time per iteration (s): 1.03 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.185214E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.723 | TFLOPs: 40.94 | 15: iteration 19750/ 125429 | consumed samples: 5056000 | consumed tokens: 10354688000 | elapsed time per iteration (s): 1.08 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.160624E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.171 | TFLOPs: 39.03 | 15: iteration 19760/ 125429 | consumed samples: 5058560 | consumed tokens: 10359930880 | elapsed time per iteration (s): 1.03 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.154351E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.611 | TFLOPs: 41.08 | 15: iteration 19770/ 125429 | consumed samples: 5061120 | consumed tokens: 10365173760 | elapsed time per iteration (s): 1.12 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.160439E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.840 | TFLOPs: 37.82 | 15: iteration 19780/ 125429 | consumed samples: 5063680 | consumed tokens: 10370416640 | elapsed time per iteration (s): 1.05 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.170051E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.664 | TFLOPs: 40.27 | 15: iteration 19790/ 125429 | consumed samples: 5066240 | consumed tokens: 10375659520 | elapsed time per iteration (s): 1.05 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.160083E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.535 | TFLOPs: 40.41 | 15: iteration 19800/ 125429 | consumed samples: 5068800 | consumed tokens: 10380902400 | elapsed time per iteration (s): 1.03 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.134144E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.469 | TFLOPs: 41.23 | 15: iteration 19810/ 125429 | consumed samples: 5071360 | consumed tokens: 10386145280 | elapsed time per iteration (s): 1.10 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.165692E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.026 | TFLOPs: 38.51 | 15: iteration 19820/ 125429 | consumed samples: 5073920 | consumed tokens: 10391388160 | elapsed time per iteration (s): 1.03 | learning rate: 1.903E-04 | global batch size: 256 | lm loss: 2.160253E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.975 | TFLOPs: 40.98 | 15: iteration 19830/ 125429 | consumed samples: 5076480 | consumed tokens: 10396631040 | elapsed time per iteration (s): 1.22 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.134089E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.426 | TFLOPs: 34.77 | 15: iteration 19840/ 125429 | consumed samples: 5079040 | consumed tokens: 10401873920 | elapsed time per iteration (s): 1.15 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.140801E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.423 | TFLOPs: 36.76 | 15: iteration 19850/ 125429 | consumed samples: 5081600 | consumed tokens: 10407116800 | elapsed time per iteration (s): 1.10 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.177161E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.807 | TFLOPs: 38.31 | 15: iteration 19860/ 125429 | consumed samples: 5084160 | consumed tokens: 10412359680 | elapsed time per iteration (s): 1.11 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.172140E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.291 | TFLOPs: 38.06 | 15: iteration 19870/ 125429 | consumed samples: 5086720 | consumed tokens: 10417602560 | elapsed time per iteration (s): 1.10 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.179789E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.879 | TFLOPs: 38.32 | 15: iteration 19880/ 125429 | consumed samples: 5089280 | consumed tokens: 10422845440 | elapsed time per iteration (s): 1.04 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.157516E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.903 | TFLOPs: 40.64 | 15: iteration 19890/ 125429 | consumed samples: 5091840 | consumed tokens: 10428088320 | elapsed time per iteration (s): 1.05 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.156855E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.196 | TFLOPs: 40.36 | 15: iteration 19900/ 125429 | consumed samples: 5094400 | consumed tokens: 10433331200 | elapsed time per iteration (s): 1.04 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.161859E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.833 | TFLOPs: 40.63 | 15: iteration 19910/ 125429 | consumed samples: 5096960 | consumed tokens: 10438574080 | elapsed time per iteration (s): 1.06 | learning rate: 1.902E-04 | global batch size: 256 | lm loss: 2.176973E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.224 | TFLOPs: 40.03 | 15: iteration 19920/ 125429 | consumed samples: 5099520 | consumed tokens: 10443816960 | elapsed time per iteration (s): 1.03 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.183458E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.638 | TFLOPs: 41.25 | 15: iteration 19930/ 125429 | consumed samples: 5102080 | consumed tokens: 10449059840 | elapsed time per iteration (s): 1.08 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.125223E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.376 | TFLOPs: 39.23 | 15: iteration 19940/ 125429 | consumed samples: 5104640 | consumed tokens: 10454302720 | elapsed time per iteration (s): 1.03 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.183063E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.185 | TFLOPs: 41.01 | 15: iteration 19950/ 125429 | consumed samples: 5107200 | consumed tokens: 10459545600 | elapsed time per iteration (s): 1.06 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.136816E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.538 | TFLOPs: 40.08 | 15: iteration 19960/ 125429 | consumed samples: 5109760 | consumed tokens: 10464788480 | elapsed time per iteration (s): 1.03 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.133354E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.385 | TFLOPs: 40.88 | 15: iteration 19970/ 125429 | consumed samples: 5112320 | consumed tokens: 10470031360 | elapsed time per iteration (s): 1.04 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.171421E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.982 | TFLOPs: 40.82 | 15: iteration 19980/ 125429 | consumed samples: 5114880 | consumed tokens: 10475274240 | elapsed time per iteration (s): 1.02 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.176104E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.430 | TFLOPs: 41.39 | 15: iteration 19990/ 125429 | consumed samples: 5117440 | consumed tokens: 10480517120 | elapsed time per iteration (s): 1.06 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.171348E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.520 | TFLOPs: 39.75 | 0: [2022-11-26 01:44:46,822] [INFO] [logging.py:68:log_dist] [Rank 0] step=20000, skipped=0, lr=[0.00019006669519151633, 0.00019006669519151633, 0.00019006669519151633], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 20000/ 125429 | consumed samples: 5120000 | consumed tokens: 10485760000 | elapsed time per iteration (s): 1.04 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.189405E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.219 | TFLOPs: 40.52 | 0: steps: 20000 loss: 2.1563 iter time (s): 1.060 samples/sec: 241.543 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 20000 | lm loss value: 2.080214E+00 | lm loss PPL: 8.006180E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 20000 to checkpoints_1b5 0: [2022-11-26 01:44:47,363] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step20000 is begin to save! 0: [2022-11-26 01:44:47,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_01-model_00-model_states.pt... 0: [2022-11-26 01:44:47,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_01-model_00-model_states.pt. 0: [2022-11-26 01:44:47,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_03-model_00-model_states.pt... 0: [2022-11-26 01:44:47,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_03-model_00-model_states.pt. 0: [2022-11-26 01:44:47,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_04-model_00-model_states.pt... 0: [2022-11-26 01:44:47,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_04-model_00-model_states.pt. 0: [2022-11-26 01:44:47,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_05-model_00-model_states.pt... 0: [2022-11-26 01:44:47,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_05-model_00-model_states.pt. 0: [2022-11-26 01:44:47,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_06-model_00-model_states.pt... 0: [2022-11-26 01:44:48,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_06-model_00-model_states.pt. 0: [2022-11-26 01:44:48,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_07-model_00-model_states.pt... 0: [2022-11-26 01:44:48,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_07-model_00-model_states.pt. 0: [2022-11-26 01:44:48,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_08-model_00-model_states.pt... 0: [2022-11-26 01:44:48,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_08-model_00-model_states.pt. 0: [2022-11-26 01:44:48,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_09-model_00-model_states.pt... 0: [2022-11-26 01:44:48,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_09-model_00-model_states.pt. 0: [2022-11-26 01:44:48,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_10-model_00-model_states.pt... 0: [2022-11-26 01:44:48,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_10-model_00-model_states.pt. 0: [2022-11-26 01:44:48,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_11-model_00-model_states.pt... 0: [2022-11-26 01:44:48,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_11-model_00-model_states.pt. 0: [2022-11-26 01:44:48,649] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_12-model_00-model_states.pt... 0: [2022-11-26 01:44:48,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_12-model_00-model_states.pt. 0: [2022-11-26 01:44:48,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_13-model_00-model_states.pt... 0: [2022-11-26 01:44:48,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_13-model_00-model_states.pt. 0: [2022-11-26 01:44:48,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_14-model_00-model_states.pt... 0: [2022-11-26 01:44:48,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_14-model_00-model_states.pt. 0: [2022-11-26 01:44:48,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_15-model_00-model_states.pt... 0: [2022-11-26 01:44:49,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_15-model_00-model_states.pt. 0: [2022-11-26 01:44:49,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_16-model_00-model_states.pt... 0: [2022-11-26 01:44:49,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_16-model_00-model_states.pt. 0: [2022-11-26 01:44:49,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_17-model_00-model_states.pt... 0: [2022-11-26 01:44:49,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_17-model_00-model_states.pt. 0: [2022-11-26 01:44:49,349] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_18-model_00-model_states.pt... 0: [2022-11-26 01:44:49,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_18-model_00-model_states.pt. 0: [2022-11-26 01:44:49,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_19-model_00-model_states.pt... 0: [2022-11-26 01:44:49,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_19-model_00-model_states.pt. 0: [2022-11-26 01:44:49,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_20-model_00-model_states.pt... 0: [2022-11-26 01:44:49,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_20-model_00-model_states.pt. 0: [2022-11-26 01:44:49,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_21-model_00-model_states.pt... 0: [2022-11-26 01:44:49,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_21-model_00-model_states.pt. 0: [2022-11-26 01:44:49,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_22-model_00-model_states.pt... 0: [2022-11-26 01:44:49,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_22-model_00-model_states.pt. 0: [2022-11-26 01:44:49,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_23-model_00-model_states.pt... 0: [2022-11-26 01:44:50,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_23-model_00-model_states.pt. 0: [2022-11-26 01:44:50,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_24-model_00-model_states.pt... 0: [2022-11-26 01:44:50,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_24-model_00-model_states.pt. 0: [2022-11-26 01:44:50,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_25-model_00-model_states.pt... 0: [2022-11-26 01:44:50,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_25-model_00-model_states.pt. 0: [2022-11-26 01:44:50,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_26-model_00-model_states.pt... 0: [2022-11-26 01:44:50,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_26-model_00-model_states.pt. 0: [2022-11-26 01:44:50,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_27-model_00-model_states.pt... 0: [2022-11-26 01:44:50,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_27-model_00-model_states.pt. 0: [2022-11-26 01:44:50,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_28-model_00-model_states.pt... 0: [2022-11-26 01:44:50,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_28-model_00-model_states.pt. 0: [2022-11-26 01:44:50,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_29-model_00-model_states.pt... 0: [2022-11-26 01:44:50,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_29-model_00-model_states.pt. 0: [2022-11-26 01:44:50,681] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_30-model_00-model_states.pt... 0: [2022-11-26 01:44:50,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_30-model_00-model_states.pt. 0: [2022-11-26 01:44:50,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/layer_32-model_00-model_states.pt... 0: [2022-11-26 01:44:50,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/layer_32-model_00-model_states.pt. 0: [2022-11-26 01:44:50,794] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step20000/mp_rank_00_model_states.pt 0: [2022-11-26 01:44:50,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/mp_rank_00_model_states.pt... 0: [2022-11-26 01:44:50,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/mp_rank_00_model_states.pt. 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 9: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 8: [2022-11-26 01:44:50,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step20000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 01:44:51,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-26 01:44:51,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:44:51,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 01:44:51,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-26 01:44:51,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:44:51,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:44:51,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:44:51,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:44:51,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-26 01:44:51,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-26 01:44:51,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:44:51,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:44:51,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:44:51,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:44:51,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-26 01:44:51,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:44:51,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 01:44:51,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:44:51,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 01:44:51,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-26 01:44:51,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-26 01:44:51,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:44:51,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 01:44:51,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 01:44:51,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-26 01:44:51,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-26 01:44:51,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:44:51,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 01:44:51,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 01:44:51,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 01:44:51,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 1: [2022-11-26 01:44:51,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:44:51,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-26 01:44:51,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 12: [2022-11-26 01:44:51,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 01:44:51,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 01:44:51,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:44:51,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-26 01:44:51,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-26 01:44:51,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-26 01:44:51,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 01:44:51,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-26 01:44:51,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 01:44:51,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 8: [2022-11-26 01:44:51,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 01:44:51,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:44:51,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-26 01:44:51,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-26 01:44:51,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 01:44:51,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 11: [2022-11-26 01:44:51,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 01:44:51,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 6: [2022-11-26 01:44:51,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 01:44:51,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 01:44:51,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 10: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 01:44:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 3: [2022-11-26 01:44:51,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 01:44:51,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 01:44:51,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 13: [2022-11-26 01:44:51,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 01:44:51,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 01:44:51,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 5: [2022-11-26 01:44:51,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 01:44:51,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 01:44:51,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 4: [2022-11-26 01:44:51,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 01:44:51,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 01:44:51,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 01:44:51,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 11: [2022-11-26 01:44:51,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 01:44:51,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 01:44:51,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 01:44:51,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 14: [2022-11-26 01:44:51,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 01:44:51,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 01:44:51,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 01:44:51,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 2: [2022-11-26 01:44:51,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-26 01:44:51,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 9: [2022-11-26 01:44:51,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 01:44:51,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 01:44:51,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: [2022-11-26 01:44:51,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 01:44:51,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 15: [2022-11-26 01:44:51,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step20000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 7: [2022-11-26 01:44:51,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step20000 is ready now! 0: successfully saved checkpoint at iteration 20000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4078.38 15: iteration 20010/ 125429 | consumed samples: 5122560 | consumed tokens: 10491002880 | elapsed time per iteration (s): 1.51 | learning rate: 1.901E-04 | global batch size: 256 | lm loss: 2.163850E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 169.934 | TFLOPs: 28.08 | 15: iteration 20020/ 125429 | consumed samples: 5125120 | consumed tokens: 10496245760 | elapsed time per iteration (s): 1.05 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.139268E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.179 | TFLOPs: 40.35 | 15: iteration 20030/ 125429 | consumed samples: 5127680 | consumed tokens: 10501488640 | elapsed time per iteration (s): 1.04 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.166454E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.212 | TFLOPs: 40.69 | 15: iteration 20040/ 125429 | consumed samples: 5130240 | consumed tokens: 10506731520 | elapsed time per iteration (s): 1.06 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.166136E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.060 | TFLOPs: 40.00 | 15: iteration 20050/ 125429 | consumed samples: 5132800 | consumed tokens: 10511974400 | elapsed time per iteration (s): 1.07 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.145830E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.422 | TFLOPs: 39.40 | 15: iteration 20060/ 125429 | consumed samples: 5135360 | consumed tokens: 10517217280 | elapsed time per iteration (s): 1.06 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.164558E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.191 | TFLOPs: 40.02 | 15: iteration 20070/ 125429 | consumed samples: 5137920 | consumed tokens: 10522460160 | elapsed time per iteration (s): 1.03 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.131258E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.221 | TFLOPs: 41.02 | 15: iteration 20080/ 125429 | consumed samples: 5140480 | consumed tokens: 10527703040 | elapsed time per iteration (s): 1.02 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.145739E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.828 | TFLOPs: 41.45 | 15: iteration 20090/ 125429 | consumed samples: 5143040 | consumed tokens: 10532945920 | elapsed time per iteration (s): 1.05 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.119370E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.541 | TFLOPs: 40.25 | 15: iteration 20100/ 125429 | consumed samples: 5145600 | consumed tokens: 10538188800 | elapsed time per iteration (s): 1.04 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.175873E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.993 | TFLOPs: 40.49 | 15: iteration 20110/ 125429 | consumed samples: 5148160 | consumed tokens: 10543431680 | elapsed time per iteration (s): 1.04 | learning rate: 1.900E-04 | global batch size: 256 | lm loss: 2.134149E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.188 | TFLOPs: 40.68 | 15: iteration 20120/ 125429 | consumed samples: 5150720 | consumed tokens: 10548674560 | elapsed time per iteration (s): 1.07 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.145938E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.622 | TFLOPs: 39.60 | 15: iteration 20130/ 125429 | consumed samples: 5153280 | consumed tokens: 10553917440 | elapsed time per iteration (s): 1.02 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.155605E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.810 | TFLOPs: 41.28 | 15: iteration 20140/ 125429 | consumed samples: 5155840 | consumed tokens: 10559160320 | elapsed time per iteration (s): 1.06 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.149444E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.536 | TFLOPs: 39.92 | 15: iteration 20150/ 125429 | consumed samples: 5158400 | consumed tokens: 10564403200 | elapsed time per iteration (s): 1.10 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.166632E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.546 | TFLOPs: 38.43 | 15: iteration 20160/ 125429 | consumed samples: 5160960 | consumed tokens: 10569646080 | elapsed time per iteration (s): 1.03 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.194229E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.746 | TFLOPs: 41.27 | 15: iteration 20170/ 125429 | consumed samples: 5163520 | consumed tokens: 10574888960 | elapsed time per iteration (s): 1.05 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.153366E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.667 | TFLOPs: 40.10 | 15: iteration 20180/ 125429 | consumed samples: 5166080 | consumed tokens: 10580131840 | elapsed time per iteration (s): 1.04 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.158275E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.257 | TFLOPs: 40.70 | 15: iteration 20190/ 125429 | consumed samples: 5168640 | consumed tokens: 10585374720 | elapsed time per iteration (s): 1.04 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.158531E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.092 | TFLOPs: 40.83 | 15: iteration 20200/ 125429 | consumed samples: 5171200 | consumed tokens: 10590617600 | elapsed time per iteration (s): 1.05 | learning rate: 1.899E-04 | global batch size: 256 | lm loss: 2.164427E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.450 | TFLOPs: 40.40 | 15: iteration 20210/ 125429 | consumed samples: 5173760 | consumed tokens: 10595860480 | elapsed time per iteration (s): 1.03 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.116615E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.658 | TFLOPs: 41.26 | 15: iteration 20220/ 125429 | consumed samples: 5176320 | consumed tokens: 10601103360 | elapsed time per iteration (s): 1.05 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.138291E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.241 | TFLOPs: 40.36 | 15: iteration 20230/ 125429 | consumed samples: 5178880 | consumed tokens: 10606346240 | elapsed time per iteration (s): 1.03 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.154934E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.169 | TFLOPs: 41.01 | 15: iteration 20240/ 125429 | consumed samples: 5181440 | consumed tokens: 10611589120 | elapsed time per iteration (s): 1.05 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.152762E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.168 | TFLOPs: 40.35 | 15: iteration 20250/ 125429 | consumed samples: 5184000 | consumed tokens: 10616832000 | elapsed time per iteration (s): 1.02 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.141936E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.052 | TFLOPs: 41.32 | 15: iteration 20260/ 125429 | consumed samples: 5186560 | consumed tokens: 10622074880 | elapsed time per iteration (s): 1.06 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.148605E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.548 | TFLOPs: 40.08 | 15: iteration 20270/ 125429 | consumed samples: 5189120 | consumed tokens: 10627317760 | elapsed time per iteration (s): 1.04 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.150918E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.349 | TFLOPs: 40.55 | 15: iteration 20280/ 125429 | consumed samples: 5191680 | consumed tokens: 10632560640 | elapsed time per iteration (s): 1.04 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.160329E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.689 | TFLOPs: 40.60 | 15: iteration 20290/ 125429 | consumed samples: 5194240 | consumed tokens: 10637803520 | elapsed time per iteration (s): 1.06 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.189363E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.492 | TFLOPs: 40.07 | 15: iteration 20300/ 125429 | consumed samples: 5196800 | consumed tokens: 10643046400 | elapsed time per iteration (s): 1.06 | learning rate: 1.898E-04 | global batch size: 256 | lm loss: 2.140710E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.606 | TFLOPs: 39.76 | 15: iteration 20310/ 125429 | consumed samples: 5199360 | consumed tokens: 10648289280 | elapsed time per iteration (s): 1.04 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.142346E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.991 | TFLOPs: 40.65 | 15: iteration 20320/ 125429 | consumed samples: 5201920 | consumed tokens: 10653532160 | elapsed time per iteration (s): 1.05 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.156618E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.000 | TFLOPs: 40.16 | 15: iteration 20330/ 125429 | consumed samples: 5204480 | consumed tokens: 10658775040 | elapsed time per iteration (s): 1.04 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.174471E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.482 | TFLOPs: 40.57 | 15: iteration 20340/ 125429 | consumed samples: 5207040 | consumed tokens: 10664017920 | elapsed time per iteration (s): 1.03 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.145140E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.968 | TFLOPs: 41.14 | 15: iteration 20350/ 125429 | consumed samples: 5209600 | consumed tokens: 10669260800 | elapsed time per iteration (s): 1.04 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.117662E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.046 | TFLOPs: 40.66 | 15: iteration 20360/ 125429 | consumed samples: 5212160 | consumed tokens: 10674503680 | elapsed time per iteration (s): 1.04 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.159971E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.216 | TFLOPs: 40.85 | 15: iteration 20370/ 125429 | consumed samples: 5214720 | consumed tokens: 10679746560 | elapsed time per iteration (s): 1.03 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.133575E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.494 | TFLOPs: 41.07 | 15: iteration 20380/ 125429 | consumed samples: 5217280 | consumed tokens: 10684989440 | elapsed time per iteration (s): 1.05 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.158286E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.852 | TFLOPs: 40.46 | 15: iteration 20390/ 125429 | consumed samples: 5219840 | consumed tokens: 10690232320 | elapsed time per iteration (s): 1.04 | learning rate: 1.897E-04 | global batch size: 256 | lm loss: 2.130874E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.531 | TFLOPs: 40.74 | 15: iteration 20400/ 125429 | consumed samples: 5222400 | consumed tokens: 10695475200 | elapsed time per iteration (s): 1.03 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.179597E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.421 | TFLOPs: 41.05 | 15: iteration 20410/ 125429 | consumed samples: 5224960 | consumed tokens: 10700718080 | elapsed time per iteration (s): 1.06 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.137255E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.829 | TFLOPs: 39.80 | 15: iteration 20420/ 125429 | consumed samples: 5227520 | consumed tokens: 10705960960 | elapsed time per iteration (s): 1.02 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.168810E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.828 | TFLOPs: 41.29 | 15: iteration 20430/ 125429 | consumed samples: 5230080 | consumed tokens: 10711203840 | elapsed time per iteration (s): 1.03 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.163371E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.339 | TFLOPs: 41.04 | 15: iteration 20440/ 125429 | consumed samples: 5232640 | consumed tokens: 10716446720 | elapsed time per iteration (s): 1.06 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.163611E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.965 | TFLOPs: 39.99 | 15: iteration 20450/ 125429 | consumed samples: 5235200 | consumed tokens: 10721689600 | elapsed time per iteration (s): 1.04 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.149997E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.355 | TFLOPs: 40.55 | 15: iteration 20460/ 125429 | consumed samples: 5237760 | consumed tokens: 10726932480 | elapsed time per iteration (s): 1.03 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.149891E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.425 | TFLOPs: 41.05 | 15: iteration 20470/ 125429 | consumed samples: 5240320 | consumed tokens: 10732175360 | elapsed time per iteration (s): 1.06 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.150317E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.634 | TFLOPs: 40.10 | 15: iteration 20480/ 125429 | consumed samples: 5242880 | consumed tokens: 10737418240 | elapsed time per iteration (s): 1.05 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.159469E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.644 | TFLOPs: 40.43 | 15: iteration 20490/ 125429 | consumed samples: 5245440 | consumed tokens: 10742661120 | elapsed time per iteration (s): 1.05 | learning rate: 1.896E-04 | global batch size: 256 | lm loss: 2.153450E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.879 | TFLOPs: 40.14 | 15: iteration 20500/ 125429 | consumed samples: 5248000 | consumed tokens: 10747904000 | elapsed time per iteration (s): 1.09 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.167599E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.366 | TFLOPs: 38.90 | 15: iteration 20510/ 125429 | consumed samples: 5250560 | consumed tokens: 10753146880 | elapsed time per iteration (s): 1.08 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.156707E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.683 | TFLOPs: 39.11 | 15: iteration 20520/ 125429 | consumed samples: 5253120 | consumed tokens: 10758389760 | elapsed time per iteration (s): 1.05 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.201561E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.725 | TFLOPs: 40.28 | 15: iteration 20530/ 125429 | consumed samples: 5255680 | consumed tokens: 10763632640 | elapsed time per iteration (s): 1.04 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.166562E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.577 | TFLOPs: 40.58 | 15: iteration 20540/ 125429 | consumed samples: 5258240 | consumed tokens: 10768875520 | elapsed time per iteration (s): 1.04 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.137962E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.664 | TFLOPs: 40.76 | 15: iteration 20550/ 125429 | consumed samples: 5260800 | consumed tokens: 10774118400 | elapsed time per iteration (s): 1.06 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.168947E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.664 | TFLOPs: 39.94 | 15: iteration 20560/ 125429 | consumed samples: 5263360 | consumed tokens: 10779361280 | elapsed time per iteration (s): 1.07 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.188609E+00 | grad norm: 1.022 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.292 | TFLOPs: 39.54 | 15: iteration 20570/ 125429 | consumed samples: 5265920 | consumed tokens: 10784604160 | elapsed time per iteration (s): 1.07 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.524808E+00 | grad norm: 7.637 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.996 | TFLOPs: 39.50 | 15: iteration 20580/ 125429 | consumed samples: 5268480 | consumed tokens: 10789847040 | elapsed time per iteration (s): 1.03 | learning rate: 1.895E-04 | global batch size: 256 | lm loss: 2.473602E+00 | grad norm: 0.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.528 | TFLOPs: 41.07 | 15: iteration 20590/ 125429 | consumed samples: 5271040 | consumed tokens: 10795089920 | elapsed time per iteration (s): 1.04 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.292712E+00 | grad norm: 0.343 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.184 | TFLOPs: 40.52 | 15: iteration 20600/ 125429 | consumed samples: 5273600 | consumed tokens: 10800332800 | elapsed time per iteration (s): 1.04 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.208369E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.420 | TFLOPs: 40.56 | 15: iteration 20610/ 125429 | consumed samples: 5276160 | consumed tokens: 10805575680 | elapsed time per iteration (s): 1.05 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.187538E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.823 | TFLOPs: 40.46 | 15: iteration 20620/ 125429 | consumed samples: 5278720 | consumed tokens: 10810818560 | elapsed time per iteration (s): 1.07 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.164443E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.708 | TFLOPs: 39.61 | 15: iteration 20630/ 125429 | consumed samples: 5281280 | consumed tokens: 10816061440 | elapsed time per iteration (s): 1.11 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.205828E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.771 | TFLOPs: 38.14 | 15: iteration 20640/ 125429 | consumed samples: 5283840 | consumed tokens: 10821304320 | elapsed time per iteration (s): 1.06 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.179970E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.286 | TFLOPs: 40.04 | 15: iteration 20650/ 125429 | consumed samples: 5286400 | consumed tokens: 10826547200 | elapsed time per iteration (s): 1.04 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.171260E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.158 | TFLOPs: 40.51 | 15: iteration 20660/ 125429 | consumed samples: 5288960 | consumed tokens: 10831790080 | elapsed time per iteration (s): 1.09 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.193327E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.347 | TFLOPs: 38.73 | 15: iteration 20670/ 125429 | consumed samples: 5291520 | consumed tokens: 10837032960 | elapsed time per iteration (s): 1.05 | learning rate: 1.894E-04 | global batch size: 256 | lm loss: 2.167720E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.698 | TFLOPs: 40.11 | 15: iteration 20680/ 125429 | consumed samples: 5294080 | consumed tokens: 10842275840 | elapsed time per iteration (s): 1.03 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.151003E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.417 | TFLOPs: 40.89 | 15: iteration 20690/ 125429 | consumed samples: 5296640 | consumed tokens: 10847518720 | elapsed time per iteration (s): 1.04 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.160155E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.460 | TFLOPs: 40.56 | 15: iteration 20700/ 125429 | consumed samples: 5299200 | consumed tokens: 10852761600 | elapsed time per iteration (s): 1.04 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.170407E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.661 | TFLOPs: 40.76 | 15: iteration 20710/ 125429 | consumed samples: 5301760 | consumed tokens: 10858004480 | elapsed time per iteration (s): 1.05 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.142788E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.851 | TFLOPs: 40.13 | 15: iteration 20720/ 125429 | consumed samples: 5304320 | consumed tokens: 10863247360 | elapsed time per iteration (s): 1.07 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.148494E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.861 | TFLOPs: 39.47 | 15: iteration 20730/ 125429 | consumed samples: 5306880 | consumed tokens: 10868490240 | elapsed time per iteration (s): 1.02 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.161064E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.339 | TFLOPs: 41.37 | 15: iteration 20740/ 125429 | consumed samples: 5309440 | consumed tokens: 10873733120 | elapsed time per iteration (s): 1.06 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.147057E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.817 | TFLOPs: 39.80 | 15: iteration 20750/ 125429 | consumed samples: 5312000 | consumed tokens: 10878976000 | elapsed time per iteration (s): 1.10 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.160876E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.501 | TFLOPs: 38.42 | 15: iteration 20760/ 125429 | consumed samples: 5314560 | consumed tokens: 10884218880 | elapsed time per iteration (s): 1.09 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.167409E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.436 | TFLOPs: 38.74 | 15: iteration 20770/ 125429 | consumed samples: 5317120 | consumed tokens: 10889461760 | elapsed time per iteration (s): 4.01 | learning rate: 1.893E-04 | global batch size: 256 | lm loss: 2.158774E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 63.841 | TFLOPs: 10.55 | 15: iteration 20780/ 125429 | consumed samples: 5319680 | consumed tokens: 10894704640 | elapsed time per iteration (s): 1.03 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.157075E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.857 | TFLOPs: 40.96 | 15: iteration 20790/ 125429 | consumed samples: 5322240 | consumed tokens: 10899947520 | elapsed time per iteration (s): 1.02 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.159076E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.893 | TFLOPs: 41.30 | 15: iteration 20800/ 125429 | consumed samples: 5324800 | consumed tokens: 10905190400 | elapsed time per iteration (s): 1.04 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.182316E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.055 | TFLOPs: 40.50 | 15: iteration 20810/ 125429 | consumed samples: 5327360 | consumed tokens: 10910433280 | elapsed time per iteration (s): 1.03 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.160726E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.470 | TFLOPs: 40.90 | 15: iteration 20820/ 125429 | consumed samples: 5329920 | consumed tokens: 10915676160 | elapsed time per iteration (s): 1.09 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.164740E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.264 | TFLOPs: 38.88 | 15: iteration 20830/ 125429 | consumed samples: 5332480 | consumed tokens: 10920919040 | elapsed time per iteration (s): 1.07 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.130014E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.654 | TFLOPs: 39.44 | 15: iteration 20840/ 125429 | consumed samples: 5335040 | consumed tokens: 10926161920 | elapsed time per iteration (s): 1.05 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.149630E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.567 | TFLOPs: 40.42 | 15: iteration 20850/ 125429 | consumed samples: 5337600 | consumed tokens: 10931404800 | elapsed time per iteration (s): 6.35 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.160765E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 40.321 | TFLOPs: 6.66 | 15: iteration 20860/ 125429 | consumed samples: 5340160 | consumed tokens: 10936647680 | elapsed time per iteration (s): 1.02 | learning rate: 1.892E-04 | global batch size: 256 | lm loss: 2.160743E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.148 | TFLOPs: 41.34 | 15: iteration 20870/ 125429 | consumed samples: 5342720 | consumed tokens: 10941890560 | elapsed time per iteration (s): 1.03 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.173990E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.959 | TFLOPs: 41.14 | 15: iteration 20880/ 125429 | consumed samples: 5345280 | consumed tokens: 10947133440 | elapsed time per iteration (s): 1.03 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.160343E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.591 | TFLOPs: 41.25 | 15: iteration 20890/ 125429 | consumed samples: 5347840 | consumed tokens: 10952376320 | elapsed time per iteration (s): 9.88 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.143530E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 25.919 | TFLOPs: 4.28 | 15: iteration 20900/ 125429 | consumed samples: 5350400 | consumed tokens: 10957619200 | elapsed time per iteration (s): 1.07 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.147952E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.642 | TFLOPs: 39.60 | 15: iteration 20910/ 125429 | consumed samples: 5352960 | consumed tokens: 10962862080 | elapsed time per iteration (s): 1.02 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.145313E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.211 | TFLOPs: 41.35 | 15: iteration 20920/ 125429 | consumed samples: 5355520 | consumed tokens: 10968104960 | elapsed time per iteration (s): 1.03 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.149537E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.143 | TFLOPs: 41.01 | 15: iteration 20930/ 125429 | consumed samples: 5358080 | consumed tokens: 10973347840 | elapsed time per iteration (s): 1.04 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.152947E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.759 | TFLOPs: 40.61 | 15: iteration 20940/ 125429 | consumed samples: 5360640 | consumed tokens: 10978590720 | elapsed time per iteration (s): 1.02 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.127250E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.465 | TFLOPs: 41.39 | 15: iteration 20950/ 125429 | consumed samples: 5363200 | consumed tokens: 10983833600 | elapsed time per iteration (s): 1.04 | learning rate: 1.891E-04 | global batch size: 256 | lm loss: 2.141319E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.085 | TFLOPs: 40.83 | 15: iteration 20960/ 125429 | consumed samples: 5365760 | consumed tokens: 10989076480 | elapsed time per iteration (s): 1.03 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.146732E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.588 | TFLOPs: 41.25 | 15: iteration 20970/ 125429 | consumed samples: 5368320 | consumed tokens: 10994319360 | elapsed time per iteration (s): 1.12 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.140163E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.971 | TFLOPs: 37.67 | 15: iteration 20980/ 125429 | consumed samples: 5370880 | consumed tokens: 10999562240 | elapsed time per iteration (s): 1.04 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.127782E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.326 | TFLOPs: 40.87 | 15: iteration 20990/ 125429 | consumed samples: 5373440 | consumed tokens: 11004805120 | elapsed time per iteration (s): 1.03 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.154096E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.452 | TFLOPs: 41.06 | 15: iteration 21000/ 125429 | consumed samples: 5376000 | consumed tokens: 11010048000 | elapsed time per iteration (s): 1.04 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.159750E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.499 | TFLOPs: 40.57 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 21000 | lm loss value: 2.127930E+00 | lm loss PPL: 8.397463E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 21000 to checkpoints_1b5 0: [2022-11-26 02:05:11,197] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step21000 is begin to save! 0: [2022-11-26 02:05:11,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:05:11,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:05:11,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:05:11,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:05:11,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:05:11,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:05:11,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:05:11,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:05:11,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:05:11,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:05:11,852] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:05:11,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:05:11,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:05:12,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:05:12,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:05:12,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:05:12,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:05:12,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:05:12,260] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:05:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:05:12,362] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:05:12,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:05:12,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:05:12,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:05:12,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:05:12,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:05:12,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:05:12,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:05:12,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:05:12,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:05:12,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:05:12,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:05:12,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:05:13,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:05:13,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:05:13,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:05:13,191] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:05:13,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:05:13,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:05:13,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:05:13,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:05:13,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:05:13,512] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:05:13,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:05:13,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:05:13,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:05:13,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:05:13,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:05:13,834] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:05:13,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:05:13,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:05:14,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:05:14,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:05:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:05:14,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_29-model_00-model_states.pt... 0: [2022-11-26 02:05:14,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_29-model_00-model_states.pt. 0: [2022-11-26 02:05:14,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:05:14,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:05:14,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/layer_32-model_00-model_states.pt... 0: [2022-11-26 02:05:14,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/layer_32-model_00-model_states.pt. 0: [2022-11-26 02:05:14,373] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step21000/mp_rank_00_model_states.pt 0: [2022-11-26 02:05:14,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:05:14,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:05:14,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step21000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:05:14,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-26 02:05:14,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-26 02:05:14,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-26 02:05:14,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:05:14,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-26 02:05:14,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:05:14,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 02:05:14,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-26 02:05:14,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:05:14,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:05:14,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:05:14,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:05:14,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:05:14,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 0: [2022-11-26 02:05:14,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 02:05:14,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-26 02:05:14,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:05:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:05:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-26 02:05:14,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:05:14,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:05:14,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 8: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:05:14,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:05:14,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 12: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 02:05:14,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:05:14,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:05:14,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:05:14,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:05:14,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 4: [2022-11-26 02:05:14,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 12: [2022-11-26 02:05:14,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:05:14,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:05:14,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 14: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 9: [2022-11-26 02:05:14,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:05:14,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 02:05:14,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 10: [2022-11-26 02:05:14,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:05:14,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:05:14,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:05:14,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:05:14,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:05:14,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:05:14,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:05:14,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 2: [2022-11-26 02:05:14,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:05:14,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 02:05:14,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 3: [2022-11-26 02:05:14,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:05:14,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:05:14,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 7: [2022-11-26 02:05:14,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:05:14,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:05:14,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 15: [2022-11-26 02:05:14,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:05:14,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:05:14,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:05:14,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:05:14,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 11: [2022-11-26 02:05:14,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:05:14,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 02:05:14,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 5: [2022-11-26 02:05:14,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:05:14,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:05:14,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 13: [2022-11-26 02:05:14,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:05:14,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:05:14,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:05:14,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:05:14,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:05:14,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:05:14,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:05:14,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:05:14,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: [2022-11-26 02:05:14,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step21000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:05:14,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step21000 is ready now! 0: successfully saved checkpoint at iteration 21000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3649.08 15: iteration 21010/ 125429 | consumed samples: 5378560 | consumed tokens: 11015290880 | elapsed time per iteration (s): 1.46 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.168337E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.145 | TFLOPs: 28.94 | 15: iteration 21020/ 125429 | consumed samples: 5381120 | consumed tokens: 11020533760 | elapsed time per iteration (s): 1.10 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.162560E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.111 | TFLOPs: 38.52 | 15: iteration 21030/ 125429 | consumed samples: 5383680 | consumed tokens: 11025776640 | elapsed time per iteration (s): 1.03 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.176876E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.106 | TFLOPs: 41.17 | 15: iteration 21040/ 125429 | consumed samples: 5386240 | consumed tokens: 11031019520 | elapsed time per iteration (s): 1.04 | learning rate: 1.890E-04 | global batch size: 256 | lm loss: 2.136082E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.310 | TFLOPs: 40.54 | 15: iteration 21050/ 125429 | consumed samples: 5388800 | consumed tokens: 11036262400 | elapsed time per iteration (s): 1.06 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.153010E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.162 | TFLOPs: 39.85 | 15: iteration 21060/ 125429 | consumed samples: 5391360 | consumed tokens: 11041505280 | elapsed time per iteration (s): 1.08 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.179004E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.710 | TFLOPs: 39.12 | 15: iteration 21070/ 125429 | consumed samples: 5393920 | consumed tokens: 11046748160 | elapsed time per iteration (s): 1.05 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.132520E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.918 | TFLOPs: 40.14 | 15: iteration 21080/ 125429 | consumed samples: 5396480 | consumed tokens: 11051991040 | elapsed time per iteration (s): 1.03 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.129099E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.708 | TFLOPs: 41.10 | 15: iteration 21090/ 125429 | consumed samples: 5399040 | consumed tokens: 11057233920 | elapsed time per iteration (s): 1.08 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.132026E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.657 | TFLOPs: 39.27 | 15: iteration 21100/ 125429 | consumed samples: 5401600 | consumed tokens: 11062476800 | elapsed time per iteration (s): 1.05 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.136013E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.371 | TFLOPs: 40.22 | 15: iteration 21110/ 125429 | consumed samples: 5404160 | consumed tokens: 11067719680 | elapsed time per iteration (s): 1.04 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.172205E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.805 | TFLOPs: 40.62 | 15: iteration 21120/ 125429 | consumed samples: 5406720 | consumed tokens: 11072962560 | elapsed time per iteration (s): 1.02 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.171412E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.937 | TFLOPs: 41.30 | 15: iteration 21130/ 125429 | consumed samples: 5409280 | consumed tokens: 11078205440 | elapsed time per iteration (s): 1.03 | learning rate: 1.889E-04 | global batch size: 256 | lm loss: 2.134496E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.597 | TFLOPs: 40.92 | 15: iteration 21140/ 125429 | consumed samples: 5411840 | consumed tokens: 11083448320 | elapsed time per iteration (s): 1.05 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.150591E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.958 | TFLOPs: 40.48 | 15: iteration 21150/ 125429 | consumed samples: 5414400 | consumed tokens: 11088691200 | elapsed time per iteration (s): 1.02 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.166835E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.174 | TFLOPs: 41.34 | 15: iteration 21160/ 125429 | consumed samples: 5416960 | consumed tokens: 11093934080 | elapsed time per iteration (s): 1.06 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.154321E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.219 | TFLOPs: 39.86 | 15: iteration 21170/ 125429 | consumed samples: 5419520 | consumed tokens: 11099176960 | elapsed time per iteration (s): 1.04 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.135472E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.777 | TFLOPs: 40.62 | 15: iteration 21180/ 125429 | consumed samples: 5422080 | consumed tokens: 11104419840 | elapsed time per iteration (s): 1.06 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.135596E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.516 | TFLOPs: 39.91 | 15: iteration 21190/ 125429 | consumed samples: 5424640 | consumed tokens: 11109662720 | elapsed time per iteration (s): 1.07 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.151859E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.639 | TFLOPs: 39.60 | 15: iteration 21200/ 125429 | consumed samples: 5427200 | consumed tokens: 11114905600 | elapsed time per iteration (s): 1.08 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.139312E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.719 | TFLOPs: 39.28 | 15: iteration 21210/ 125429 | consumed samples: 5429760 | consumed tokens: 11120148480 | elapsed time per iteration (s): 1.05 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.131551E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.099 | TFLOPs: 40.34 | 15: iteration 21220/ 125429 | consumed samples: 5432320 | consumed tokens: 11125391360 | elapsed time per iteration (s): 1.07 | learning rate: 1.888E-04 | global batch size: 256 | lm loss: 2.146456E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.766 | TFLOPs: 39.46 | 15: iteration 21230/ 125429 | consumed samples: 5434880 | consumed tokens: 11130634240 | elapsed time per iteration (s): 1.10 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.126230E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.211 | TFLOPs: 38.54 | 15: iteration 21240/ 125429 | consumed samples: 5437440 | consumed tokens: 11135877120 | elapsed time per iteration (s): 1.05 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.115465E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.230 | TFLOPs: 40.36 | 15: iteration 21250/ 125429 | consumed samples: 5440000 | consumed tokens: 11141120000 | elapsed time per iteration (s): 1.05 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.177296E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.125 | TFLOPs: 40.34 | 15: iteration 21260/ 125429 | consumed samples: 5442560 | consumed tokens: 11146362880 | elapsed time per iteration (s): 1.04 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.152759E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.549 | TFLOPs: 40.74 | 15: iteration 21270/ 125429 | consumed samples: 5445120 | consumed tokens: 11151605760 | elapsed time per iteration (s): 1.05 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.132648E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.933 | TFLOPs: 40.15 | 15: iteration 21280/ 125429 | consumed samples: 5447680 | consumed tokens: 11156848640 | elapsed time per iteration (s): 1.04 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.145096E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.934 | TFLOPs: 40.81 | 15: iteration 21290/ 125429 | consumed samples: 5450240 | consumed tokens: 11162091520 | elapsed time per iteration (s): 1.03 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.187146E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.304 | TFLOPs: 41.03 | 15: iteration 21300/ 125429 | consumed samples: 5452800 | consumed tokens: 11167334400 | elapsed time per iteration (s): 1.03 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.141294E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.728 | TFLOPs: 40.94 | 15: iteration 21310/ 125429 | consumed samples: 5455360 | consumed tokens: 11172577280 | elapsed time per iteration (s): 1.04 | learning rate: 1.887E-04 | global batch size: 256 | lm loss: 2.136196E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.056 | TFLOPs: 40.83 | 15: iteration 21320/ 125429 | consumed samples: 5457920 | consumed tokens: 11177820160 | elapsed time per iteration (s): 1.04 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.151693E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.305 | TFLOPs: 40.54 | 15: iteration 21330/ 125429 | consumed samples: 5460480 | consumed tokens: 11183063040 | elapsed time per iteration (s): 1.05 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.131196E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.532 | TFLOPs: 40.41 | 15: iteration 21340/ 125429 | consumed samples: 5463040 | consumed tokens: 11188305920 | elapsed time per iteration (s): 1.04 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.139366E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.294 | TFLOPs: 40.70 | 15: iteration 21350/ 125429 | consumed samples: 5465600 | consumed tokens: 11193548800 | elapsed time per iteration (s): 1.02 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.165294E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.831 | TFLOPs: 41.29 | 15: iteration 21360/ 125429 | consumed samples: 5468160 | consumed tokens: 11198791680 | elapsed time per iteration (s): 1.05 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.164797E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.252 | TFLOPs: 40.36 | 15: iteration 21370/ 125429 | consumed samples: 5470720 | consumed tokens: 11204034560 | elapsed time per iteration (s): 1.04 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.167751E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.117 | TFLOPs: 40.67 | 15: iteration 21380/ 125429 | consumed samples: 5473280 | consumed tokens: 11209277440 | elapsed time per iteration (s): 1.07 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.117445E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.637 | TFLOPs: 39.44 | 15: iteration 21390/ 125429 | consumed samples: 5475840 | consumed tokens: 11214520320 | elapsed time per iteration (s): 1.07 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.150704E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.935 | TFLOPs: 39.65 | 15: iteration 21400/ 125429 | consumed samples: 5478400 | consumed tokens: 11219763200 | elapsed time per iteration (s): 1.08 | learning rate: 1.886E-04 | global batch size: 256 | lm loss: 2.144708E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.587 | TFLOPs: 39.26 | 15: iteration 21410/ 125429 | consumed samples: 5480960 | consumed tokens: 11225006080 | elapsed time per iteration (s): 1.03 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.168357E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.596 | TFLOPs: 41.08 | 15: iteration 21420/ 125429 | consumed samples: 5483520 | consumed tokens: 11230248960 | elapsed time per iteration (s): 1.05 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.146331E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.704 | TFLOPs: 40.27 | 15: iteration 21430/ 125429 | consumed samples: 5486080 | consumed tokens: 11235491840 | elapsed time per iteration (s): 1.05 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.177459E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.492 | TFLOPs: 40.40 | 15: iteration 21440/ 125429 | consumed samples: 5488640 | consumed tokens: 11240734720 | elapsed time per iteration (s): 1.05 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.169417E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.456 | TFLOPs: 40.40 | 15: iteration 21450/ 125429 | consumed samples: 5491200 | consumed tokens: 11245977600 | elapsed time per iteration (s): 1.06 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.165872E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.771 | TFLOPs: 39.95 | 15: iteration 21460/ 125429 | consumed samples: 5493760 | consumed tokens: 11251220480 | elapsed time per iteration (s): 1.06 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.144802E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.007 | TFLOPs: 39.99 | 15: iteration 21470/ 125429 | consumed samples: 5496320 | consumed tokens: 11256463360 | elapsed time per iteration (s): 1.03 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.146199E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.669 | TFLOPs: 40.93 | 15: iteration 21480/ 125429 | consumed samples: 5498880 | consumed tokens: 11261706240 | elapsed time per iteration (s): 1.03 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.149143E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.917 | TFLOPs: 40.97 | 15: iteration 21490/ 125429 | consumed samples: 5501440 | consumed tokens: 11266949120 | elapsed time per iteration (s): 1.06 | learning rate: 1.885E-04 | global batch size: 256 | lm loss: 2.175558E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.292 | TFLOPs: 40.04 | 15: iteration 21500/ 125429 | consumed samples: 5504000 | consumed tokens: 11272192000 | elapsed time per iteration (s): 1.25 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.152064E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 205.169 | TFLOPs: 33.91 | 15: iteration 21510/ 125429 | consumed samples: 5506560 | consumed tokens: 11277434880 | elapsed time per iteration (s): 1.08 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.118361E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.337 | TFLOPs: 39.22 | 15: iteration 21520/ 125429 | consumed samples: 5509120 | consumed tokens: 11282677760 | elapsed time per iteration (s): 1.04 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.157930E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.992 | TFLOPs: 40.49 | 15: iteration 21530/ 125429 | consumed samples: 5511680 | consumed tokens: 11287920640 | elapsed time per iteration (s): 1.05 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.122737E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.809 | TFLOPs: 40.29 | 15: iteration 21540/ 125429 | consumed samples: 5514240 | consumed tokens: 11293163520 | elapsed time per iteration (s): 1.03 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.109448E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.497 | TFLOPs: 40.90 | 15: iteration 21550/ 125429 | consumed samples: 5516800 | consumed tokens: 11298406400 | elapsed time per iteration (s): 1.07 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.160149E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.718 | TFLOPs: 39.45 | 15: iteration 21560/ 125429 | consumed samples: 5519360 | consumed tokens: 11303649280 | elapsed time per iteration (s): 1.06 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.134972E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.550 | TFLOPs: 39.75 | 15: iteration 21570/ 125429 | consumed samples: 5521920 | consumed tokens: 11308892160 | elapsed time per iteration (s): 1.04 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.135796E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.615 | TFLOPs: 40.75 | 15: iteration 21580/ 125429 | consumed samples: 5524480 | consumed tokens: 11314135040 | elapsed time per iteration (s): 1.05 | learning rate: 1.884E-04 | global batch size: 256 | lm loss: 2.146131E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.714 | TFLOPs: 40.44 | 15: iteration 21590/ 125429 | consumed samples: 5527040 | consumed tokens: 11319377920 | elapsed time per iteration (s): 1.04 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.111985E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.298 | TFLOPs: 40.70 | 15: iteration 21600/ 125429 | consumed samples: 5529600 | consumed tokens: 11324620800 | elapsed time per iteration (s): 1.08 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.155274E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.066 | TFLOPs: 39.18 | 15: iteration 21610/ 125429 | consumed samples: 5532160 | consumed tokens: 11329863680 | elapsed time per iteration (s): 1.05 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.168211E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.764 | TFLOPs: 40.28 | 15: iteration 21620/ 125429 | consumed samples: 5534720 | consumed tokens: 11335106560 | elapsed time per iteration (s): 1.05 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.163009E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.576 | TFLOPs: 40.25 | 15: iteration 21630/ 125429 | consumed samples: 5537280 | consumed tokens: 11340349440 | elapsed time per iteration (s): 1.04 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.173413E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.042 | TFLOPs: 40.66 | 15: iteration 21640/ 125429 | consumed samples: 5539840 | consumed tokens: 11345592320 | elapsed time per iteration (s): 1.03 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.157125E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.962 | TFLOPs: 40.98 | 15: iteration 21650/ 125429 | consumed samples: 5542400 | consumed tokens: 11350835200 | elapsed time per iteration (s): 1.06 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.134928E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.184 | TFLOPs: 40.02 | 15: iteration 21660/ 125429 | consumed samples: 5544960 | consumed tokens: 11356078080 | elapsed time per iteration (s): 1.04 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.149493E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.212 | TFLOPs: 40.52 | 15: iteration 21670/ 125429 | consumed samples: 5547520 | consumed tokens: 11361320960 | elapsed time per iteration (s): 1.04 | learning rate: 1.883E-04 | global batch size: 256 | lm loss: 2.129122E+00 | grad norm: 1.455 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.176 | TFLOPs: 40.68 | 15: iteration 21680/ 125429 | consumed samples: 5550080 | consumed tokens: 11366563840 | elapsed time per iteration (s): 1.03 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.235489E+00 | grad norm: 1.503 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.296 | TFLOPs: 41.03 | 15: iteration 21690/ 125429 | consumed samples: 5552640 | consumed tokens: 11371806720 | elapsed time per iteration (s): 1.05 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.167099E+00 | grad norm: 0.211 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.318 | TFLOPs: 40.38 | 15: iteration 21700/ 125429 | consumed samples: 5555200 | consumed tokens: 11377049600 | elapsed time per iteration (s): 1.02 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.140947E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.026 | TFLOPs: 41.32 | 15: iteration 21710/ 125429 | consumed samples: 5557760 | consumed tokens: 11382292480 | elapsed time per iteration (s): 1.04 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.144406E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.814 | TFLOPs: 40.79 | 15: iteration 21720/ 125429 | consumed samples: 5560320 | consumed tokens: 11387535360 | elapsed time per iteration (s): 1.03 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.151542E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.811 | TFLOPs: 40.95 | 15: iteration 21730/ 125429 | consumed samples: 5562880 | consumed tokens: 11392778240 | elapsed time per iteration (s): 1.05 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.139477E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.169 | TFLOPs: 40.19 | 15: iteration 21740/ 125429 | consumed samples: 5565440 | consumed tokens: 11398021120 | elapsed time per iteration (s): 1.05 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.120831E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.373 | TFLOPs: 40.22 | 15: iteration 21750/ 125429 | consumed samples: 5568000 | consumed tokens: 11403264000 | elapsed time per iteration (s): 1.06 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.142774E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.547 | TFLOPs: 40.08 | 15: iteration 21760/ 125429 | consumed samples: 5570560 | consumed tokens: 11408506880 | elapsed time per iteration (s): 1.03 | learning rate: 1.882E-04 | global batch size: 256 | lm loss: 2.159757E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.911 | TFLOPs: 40.97 | 15: iteration 21770/ 125429 | consumed samples: 5573120 | consumed tokens: 11413749760 | elapsed time per iteration (s): 1.04 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.139100E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.810 | TFLOPs: 40.62 | 15: iteration 21780/ 125429 | consumed samples: 5575680 | consumed tokens: 11418992640 | elapsed time per iteration (s): 1.07 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.131958E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.672 | TFLOPs: 39.61 | 15: iteration 21790/ 125429 | consumed samples: 5578240 | consumed tokens: 11424235520 | elapsed time per iteration (s): 1.02 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.148176E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.789 | TFLOPs: 41.28 | 15: iteration 21800/ 125429 | consumed samples: 5580800 | consumed tokens: 11429478400 | elapsed time per iteration (s): 1.02 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.148534E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.907 | TFLOPs: 41.30 | 15: iteration 21810/ 125429 | consumed samples: 5583360 | consumed tokens: 11434721280 | elapsed time per iteration (s): 1.02 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.135633E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.452 | TFLOPs: 41.55 | 15: iteration 21820/ 125429 | consumed samples: 5585920 | consumed tokens: 11439964160 | elapsed time per iteration (s): 1.03 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.145990E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.135 | TFLOPs: 41.01 | 15: iteration 21830/ 125429 | consumed samples: 5588480 | consumed tokens: 11445207040 | elapsed time per iteration (s): 1.08 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.154736E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.915 | TFLOPs: 39.15 | 15: iteration 21840/ 125429 | consumed samples: 5591040 | consumed tokens: 11450449920 | elapsed time per iteration (s): 1.03 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.153669E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.487 | TFLOPs: 40.90 | 15: iteration 21850/ 125429 | consumed samples: 5593600 | consumed tokens: 11455692800 | elapsed time per iteration (s): 1.04 | learning rate: 1.881E-04 | global batch size: 256 | lm loss: 2.167941E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.722 | TFLOPs: 40.77 | 15: iteration 21860/ 125429 | consumed samples: 5596160 | consumed tokens: 11460935680 | elapsed time per iteration (s): 1.07 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.169865E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.078 | TFLOPs: 39.67 | 15: iteration 21870/ 125429 | consumed samples: 5598720 | consumed tokens: 11466178560 | elapsed time per iteration (s): 1.06 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.160811E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.002 | TFLOPs: 39.83 | 15: iteration 21880/ 125429 | consumed samples: 5601280 | consumed tokens: 11471421440 | elapsed time per iteration (s): 1.03 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.126979E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.874 | TFLOPs: 40.96 | 15: iteration 21890/ 125429 | consumed samples: 5603840 | consumed tokens: 11476664320 | elapsed time per iteration (s): 1.04 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.122027E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.721 | TFLOPs: 40.77 | 15: iteration 21900/ 125429 | consumed samples: 5606400 | consumed tokens: 11481907200 | elapsed time per iteration (s): 1.05 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.159571E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.126 | TFLOPs: 40.34 | 15: iteration 21910/ 125429 | consumed samples: 5608960 | consumed tokens: 11487150080 | elapsed time per iteration (s): 1.06 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.150211E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.020 | TFLOPs: 40.00 | 15: iteration 21920/ 125429 | consumed samples: 5611520 | consumed tokens: 11492392960 | elapsed time per iteration (s): 1.08 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.125018E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.489 | TFLOPs: 39.08 | 15: iteration 21930/ 125429 | consumed samples: 5614080 | consumed tokens: 11497635840 | elapsed time per iteration (s): 1.06 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.129199E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.612 | TFLOPs: 39.76 | 15: iteration 21940/ 125429 | consumed samples: 5616640 | consumed tokens: 11502878720 | elapsed time per iteration (s): 1.05 | learning rate: 1.880E-04 | global batch size: 256 | lm loss: 2.167809E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.317 | TFLOPs: 40.38 | 15: iteration 21950/ 125429 | consumed samples: 5619200 | consumed tokens: 11508121600 | elapsed time per iteration (s): 1.07 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.164763E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.298 | TFLOPs: 39.55 | 15: iteration 21960/ 125429 | consumed samples: 5621760 | consumed tokens: 11513364480 | elapsed time per iteration (s): 1.03 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.154734E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.045 | TFLOPs: 40.99 | 15: iteration 21970/ 125429 | consumed samples: 5624320 | consumed tokens: 11518607360 | elapsed time per iteration (s): 1.04 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.155910E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.141 | TFLOPs: 40.84 | 15: iteration 21980/ 125429 | consumed samples: 5626880 | consumed tokens: 11523850240 | elapsed time per iteration (s): 1.04 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.134263E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.142 | TFLOPs: 40.68 | 15: iteration 21990/ 125429 | consumed samples: 5629440 | consumed tokens: 11529093120 | elapsed time per iteration (s): 1.08 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.174944E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.239 | TFLOPs: 39.21 | 0: [2022-11-26 02:22:45,913] [INFO] [logging.py:68:log_dist] [Rank 0] step=22000, skipped=0, lr=[0.00018788539534466566, 0.00018788539534466566, 0.00018788539534466566], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 22000/ 125429 | consumed samples: 5632000 | consumed tokens: 11534336000 | elapsed time per iteration (s): 1.03 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.133154E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.496 | TFLOPs: 41.07 | 0: steps: 22000 loss: 2.1177 iter time (s): 1.132 samples/sec: 226.056 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 22000 | lm loss value: 2.096421E+00 | lm loss PPL: 8.136999E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 22000 to checkpoints_1b5 0: [2022-11-26 02:22:46,282] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step22000 is begin to save! 0: [2022-11-26 02:22:46,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:22:46,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:22:46,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:22:46,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:22:46,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:22:46,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:22:46,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:22:46,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:22:46,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:22:47,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:22:47,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:22:47,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:22:47,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:22:47,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:22:47,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:22:47,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:22:47,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:22:47,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:22:47,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:22:47,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:22:47,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:22:47,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:22:47,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:22:47,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:22:47,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:22:47,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:22:47,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:22:48,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:22:48,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:22:48,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:22:48,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:22:48,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:22:48,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:22:48,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:22:48,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:22:48,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:22:48,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:22:48,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:22:48,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:22:48,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:22:48,691] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:22:48,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:22:48,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:22:48,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:22:48,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:22:49,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:22:49,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:22:49,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:22:49,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:22:49,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:22:49,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:22:49,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:22:49,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:22:49,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:22:49,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_29-model_00-model_states.pt... 0: [2022-11-26 02:22:49,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_29-model_00-model_states.pt. 0: [2022-11-26 02:22:49,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:22:49,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:22:49,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/layer_32-model_00-model_states.pt... 0: [2022-11-26 02:22:49,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/layer_32-model_00-model_states.pt. 0: [2022-11-26 02:22:49,683] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step22000/mp_rank_00_model_states.pt 0: [2022-11-26 02:22:49,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:22:49,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:22:49,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:22:49,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:49,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 02:22:49,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:22:49,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:22:49,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:22:49,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:22:49,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:22:49,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:22:49,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-26 02:22:49,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:22:49,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:22:49,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 02:22:49,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:49,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-26 02:22:49,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:49,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:22:49,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:22:49,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-26 02:22:49,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:49,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:49,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 13: [2022-11-26 02:22:49,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:22:49,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 2: [2022-11-26 02:22:49,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:22:49,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:22:49,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:22:49,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:49,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-26 02:22:49,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-26 02:22:49,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 02:22:49,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:22:49,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-26 02:22:49,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:49,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:49,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 02:22:49,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:22:49,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:22:49,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:22:49,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 02:22:49,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 02:22:49,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 02:22:49,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:22:49,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:22:49,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:22:49,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:22:49,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-26 02:22:49,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 02:22:49,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:22:49,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 02:22:49,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:22:49,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:22:49,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:49,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 02:22:49,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:49,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:22:49,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:22:49,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:22:49,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 7: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 02:22:49,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:22:49,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:22:49,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:22:49,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 02:22:49,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 02:22:49,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-26 02:22:49,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:49,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:49,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:49,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-26 02:22:49,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:49,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:22:49,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:49,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-26 02:22:49,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-26 02:22:49,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:49,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 02:22:49,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:22:49,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:22:49,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:22:49,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:22:49,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 2: [2022-11-26 02:22:49,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 2: [2022-11-26 02:22:49,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:22:49,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 9: [2022-11-26 02:22:49,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:22:49,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 02:22:49,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:22:49,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 4: [2022-11-26 02:22:49,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:22:49,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:22:49,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-26 02:22:49,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 3: [2022-11-26 02:22:49,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:22:49,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 02:22:49,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-26 02:22:49,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:49,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:49,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-26 02:22:49,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:22:49,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 8: [2022-11-26 02:22:49,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:22:49,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:22:49,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 15: [2022-11-26 02:22:49,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 02:22:49,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 02:22:49,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:22:49,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 13: [2022-11-26 02:22:49,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:22:49,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 02:22:49,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 6: [2022-11-26 02:22:49,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:22:49,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:22:49,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 5: [2022-11-26 02:22:49,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:22:49,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:22:49,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-26 02:22:49,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:49,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:49,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 7: [2022-11-26 02:22:49,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:22:49,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:22:49,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: [2022-11-26 02:22:49,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:22:49,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 12: [2022-11-26 02:22:49,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:22:49,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 02:22:49,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 02:22:49,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:49,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 10: [2022-11-26 02:22:49,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:22:49,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 02:22:49,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 02:22:49,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:49,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:49,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 11: [2022-11-26 02:22:50,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:22:50,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:22:50,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 02:22:50,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:50,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:50,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-26 02:22:50,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:50,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:50,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-26 02:22:50,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:50,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:50,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 1: [2022-11-26 02:22:50,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:22:50,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:22:50,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-26 02:22:50,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:50,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:50,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 14: [2022-11-26 02:22:50,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:22:50,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step22000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 02:22:50,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step22000 is ready now! 0: successfully saved checkpoint at iteration 22000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3815.76 15: iteration 22010/ 125429 | consumed samples: 5634560 | consumed tokens: 11539578880 | elapsed time per iteration (s): 1.44 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.143506E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.811 | TFLOPs: 29.38 | 15: iteration 22020/ 125429 | consumed samples: 5637120 | consumed tokens: 11544821760 | elapsed time per iteration (s): 1.03 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.142319E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.038 | TFLOPs: 41.16 | 15: iteration 22030/ 125429 | consumed samples: 5639680 | consumed tokens: 11550064640 | elapsed time per iteration (s): 1.03 | learning rate: 1.879E-04 | global batch size: 256 | lm loss: 2.148853E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.162 | TFLOPs: 41.01 | 15: iteration 22040/ 125429 | consumed samples: 5642240 | consumed tokens: 11555307520 | elapsed time per iteration (s): 1.04 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.110970E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.172 | TFLOPs: 40.85 | 15: iteration 22050/ 125429 | consumed samples: 5644800 | consumed tokens: 11560550400 | elapsed time per iteration (s): 1.04 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.131649E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.110 | TFLOPs: 40.84 | 15: iteration 22060/ 125429 | consumed samples: 5647360 | consumed tokens: 11565793280 | elapsed time per iteration (s): 1.04 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.146931E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.415 | TFLOPs: 40.72 | 15: iteration 22070/ 125429 | consumed samples: 5649920 | consumed tokens: 11571036160 | elapsed time per iteration (s): 1.06 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.239380E+00 | grad norm: 0.498 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.502 | TFLOPs: 39.91 | 15: iteration 22080/ 125429 | consumed samples: 5652480 | consumed tokens: 11576279040 | elapsed time per iteration (s): 1.05 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.193028E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.787 | TFLOPs: 40.12 | 15: iteration 22090/ 125429 | consumed samples: 5655040 | consumed tokens: 11581521920 | elapsed time per iteration (s): 1.07 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.178136E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.816 | TFLOPs: 39.63 | 15: iteration 22100/ 125429 | consumed samples: 5657600 | consumed tokens: 11586764800 | elapsed time per iteration (s): 1.04 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.165952E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.186 | TFLOPs: 40.68 | 15: iteration 22110/ 125429 | consumed samples: 5660160 | consumed tokens: 11592007680 | elapsed time per iteration (s): 1.07 | learning rate: 1.878E-04 | global batch size: 256 | lm loss: 2.164281E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.936 | TFLOPs: 39.65 | 15: iteration 22120/ 125429 | consumed samples: 5662720 | consumed tokens: 11597250560 | elapsed time per iteration (s): 1.02 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.154331E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.155 | TFLOPs: 41.51 | 15: iteration 22130/ 125429 | consumed samples: 5665280 | consumed tokens: 11602493440 | elapsed time per iteration (s): 1.11 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.163284E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.842 | TFLOPs: 37.98 | 15: iteration 22140/ 125429 | consumed samples: 5667840 | consumed tokens: 11607736320 | elapsed time per iteration (s): 1.02 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.111747E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.380 | TFLOPs: 41.54 | 15: iteration 22150/ 125429 | consumed samples: 5670400 | consumed tokens: 11612979200 | elapsed time per iteration (s): 1.06 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.135390E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.603 | TFLOPs: 39.76 | 15: iteration 22160/ 125429 | consumed samples: 5672960 | consumed tokens: 11618222080 | elapsed time per iteration (s): 1.02 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.167751E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.402 | TFLOPs: 41.55 | 15: iteration 22170/ 125429 | consumed samples: 5675520 | consumed tokens: 11623464960 | elapsed time per iteration (s): 1.07 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.116382E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.029 | TFLOPs: 39.50 | 15: iteration 22180/ 125429 | consumed samples: 5678080 | consumed tokens: 11628707840 | elapsed time per iteration (s): 1.05 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.152301E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.807 | TFLOPs: 40.29 | 15: iteration 22190/ 125429 | consumed samples: 5680640 | consumed tokens: 11633950720 | elapsed time per iteration (s): 1.11 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.163912E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.120 | TFLOPs: 38.19 | 15: iteration 22200/ 125429 | consumed samples: 5683200 | consumed tokens: 11639193600 | elapsed time per iteration (s): 1.07 | learning rate: 1.877E-04 | global batch size: 256 | lm loss: 2.139377E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.341 | TFLOPs: 39.39 | 15: iteration 22210/ 125429 | consumed samples: 5685760 | consumed tokens: 11644436480 | elapsed time per iteration (s): 1.07 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.156211E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.157 | TFLOPs: 39.69 | 15: iteration 22220/ 125429 | consumed samples: 5688320 | consumed tokens: 11649679360 | elapsed time per iteration (s): 1.07 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.137085E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.493 | TFLOPs: 39.41 | 15: iteration 22230/ 125429 | consumed samples: 5690880 | consumed tokens: 11654922240 | elapsed time per iteration (s): 1.06 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.162547E+00 | grad norm: 0.628 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.978 | TFLOPs: 39.82 | 15: iteration 22240/ 125429 | consumed samples: 5693440 | consumed tokens: 11660165120 | elapsed time per iteration (s): 1.06 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.163538E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.690 | TFLOPs: 39.78 | 15: iteration 22250/ 125429 | consumed samples: 5696000 | consumed tokens: 11665408000 | elapsed time per iteration (s): 1.06 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.110435E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.871 | TFLOPs: 39.97 | 15: iteration 22260/ 125429 | consumed samples: 5698560 | consumed tokens: 11670650880 | elapsed time per iteration (s): 1.05 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.148575E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.502 | TFLOPs: 40.41 | 15: iteration 22270/ 125429 | consumed samples: 5701120 | consumed tokens: 11675893760 | elapsed time per iteration (s): 1.05 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.165853E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.874 | TFLOPs: 40.30 | 15: iteration 22280/ 125429 | consumed samples: 5703680 | consumed tokens: 11681136640 | elapsed time per iteration (s): 1.14 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.162026E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.277 | TFLOPs: 37.23 | 15: iteration 22290/ 125429 | consumed samples: 5706240 | consumed tokens: 11686379520 | elapsed time per iteration (s): 1.06 | learning rate: 1.876E-04 | global batch size: 256 | lm loss: 2.183591E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.413 | TFLOPs: 39.90 | 15: iteration 22300/ 125429 | consumed samples: 5708800 | consumed tokens: 11691622400 | elapsed time per iteration (s): 1.04 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.125704E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.313 | TFLOPs: 40.54 | 15: iteration 22310/ 125429 | consumed samples: 5711360 | consumed tokens: 11696865280 | elapsed time per iteration (s): 1.04 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.140130E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.855 | TFLOPs: 40.79 | 15: iteration 22320/ 125429 | consumed samples: 5713920 | consumed tokens: 11702108160 | elapsed time per iteration (s): 1.04 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.122755E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.934 | TFLOPs: 40.81 | 15: iteration 22330/ 125429 | consumed samples: 5716480 | consumed tokens: 11707351040 | elapsed time per iteration (s): 1.03 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.148856E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.671 | TFLOPs: 40.93 | 15: iteration 22340/ 125429 | consumed samples: 5719040 | consumed tokens: 11712593920 | elapsed time per iteration (s): 1.08 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.127472E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.315 | TFLOPs: 39.22 | 15: iteration 22350/ 125429 | consumed samples: 5721600 | consumed tokens: 11717836800 | elapsed time per iteration (s): 1.06 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.097424E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.507 | TFLOPs: 39.91 | 15: iteration 22360/ 125429 | consumed samples: 5724160 | consumed tokens: 11723079680 | elapsed time per iteration (s): 1.03 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.137980E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.685 | TFLOPs: 41.26 | 15: iteration 22370/ 125429 | consumed samples: 5726720 | consumed tokens: 11728322560 | elapsed time per iteration (s): 1.04 | learning rate: 1.875E-04 | global batch size: 256 | lm loss: 2.134526E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.280 | TFLOPs: 40.86 | 15: iteration 22380/ 125429 | consumed samples: 5729280 | consumed tokens: 11733565440 | elapsed time per iteration (s): 1.07 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.125749E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.345 | TFLOPs: 39.39 | 15: iteration 22390/ 125429 | consumed samples: 5731840 | consumed tokens: 11738808320 | elapsed time per iteration (s): 1.05 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.092149E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.219 | TFLOPs: 40.36 | 15: iteration 22400/ 125429 | consumed samples: 5734400 | consumed tokens: 11744051200 | elapsed time per iteration (s): 1.03 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.131025E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.606 | TFLOPs: 41.08 | 15: iteration 22410/ 125429 | consumed samples: 5736960 | consumed tokens: 11749294080 | elapsed time per iteration (s): 1.05 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.132589E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.541 | TFLOPs: 40.41 | 15: iteration 22420/ 125429 | consumed samples: 5739520 | consumed tokens: 11754536960 | elapsed time per iteration (s): 1.05 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.109226E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.808 | TFLOPs: 40.46 | 15: iteration 22430/ 125429 | consumed samples: 5742080 | consumed tokens: 11759779840 | elapsed time per iteration (s): 1.06 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.107056E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.618 | TFLOPs: 39.76 | 15: iteration 22440/ 125429 | consumed samples: 5744640 | consumed tokens: 11765022720 | elapsed time per iteration (s): 1.08 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.138797E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.473 | TFLOPs: 39.24 | 15: iteration 22450/ 125429 | consumed samples: 5747200 | consumed tokens: 11770265600 | elapsed time per iteration (s): 1.05 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.151795E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.691 | TFLOPs: 40.44 | 15: iteration 22460/ 125429 | consumed samples: 5749760 | consumed tokens: 11775508480 | elapsed time per iteration (s): 1.04 | learning rate: 1.874E-04 | global batch size: 256 | lm loss: 2.151418E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.558 | TFLOPs: 40.75 | 15: iteration 22470/ 125429 | consumed samples: 5752320 | consumed tokens: 11780751360 | elapsed time per iteration (s): 1.05 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.153292E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.968 | TFLOPs: 40.15 | 15: iteration 22480/ 125429 | consumed samples: 5754880 | consumed tokens: 11785994240 | elapsed time per iteration (s): 1.07 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.128541E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.594 | TFLOPs: 39.43 | 15: iteration 22490/ 125429 | consumed samples: 5757440 | consumed tokens: 11791237120 | elapsed time per iteration (s): 1.04 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.118546E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.107 | TFLOPs: 40.67 | 15: iteration 22500/ 125429 | consumed samples: 5760000 | consumed tokens: 11796480000 | elapsed time per iteration (s): 1.04 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.139354E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.088 | TFLOPs: 40.50 | 15: iteration 22510/ 125429 | consumed samples: 5762560 | consumed tokens: 11801722880 | elapsed time per iteration (s): 2.91 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.148242E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 87.958 | TFLOPs: 14.54 | 15: iteration 22520/ 125429 | consumed samples: 5765120 | consumed tokens: 11806965760 | elapsed time per iteration (s): 1.05 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.150276E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.419 | TFLOPs: 40.23 | 15: iteration 22530/ 125429 | consumed samples: 5767680 | consumed tokens: 11812208640 | elapsed time per iteration (s): 1.05 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.152660E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.179 | TFLOPs: 40.35 | 15: iteration 22540/ 125429 | consumed samples: 5770240 | consumed tokens: 11817451520 | elapsed time per iteration (s): 1.07 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.135564E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.574 | TFLOPs: 39.59 | 15: iteration 22550/ 125429 | consumed samples: 5772800 | consumed tokens: 11822694400 | elapsed time per iteration (s): 1.11 | learning rate: 1.873E-04 | global batch size: 256 | lm loss: 2.172633E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.359 | TFLOPs: 38.07 | 15: iteration 22560/ 125429 | consumed samples: 5775360 | consumed tokens: 11827937280 | elapsed time per iteration (s): 1.05 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.128907E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.632 | TFLOPs: 40.26 | 15: iteration 22570/ 125429 | consumed samples: 5777920 | consumed tokens: 11833180160 | elapsed time per iteration (s): 1.08 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.130240E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.001 | TFLOPs: 39.33 | 15: iteration 22580/ 125429 | consumed samples: 5780480 | consumed tokens: 11838423040 | elapsed time per iteration (s): 1.06 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.114531E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.553 | TFLOPs: 39.75 | 15: iteration 22590/ 125429 | consumed samples: 5783040 | consumed tokens: 11843665920 | elapsed time per iteration (s): 1.03 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.132929E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.677 | TFLOPs: 40.93 | 15: iteration 22600/ 125429 | consumed samples: 5785600 | consumed tokens: 11848908800 | elapsed time per iteration (s): 1.05 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.140938E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.780 | TFLOPs: 40.45 | 15: iteration 22610/ 125429 | consumed samples: 5788160 | consumed tokens: 11854151680 | elapsed time per iteration (s): 1.11 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.116542E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.109 | TFLOPs: 38.03 | 15: iteration 22620/ 125429 | consumed samples: 5790720 | consumed tokens: 11859394560 | elapsed time per iteration (s): 1.10 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.129807E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.127 | TFLOPs: 38.53 | 15: iteration 22630/ 125429 | consumed samples: 5793280 | consumed tokens: 11864637440 | elapsed time per iteration (s): 1.04 | learning rate: 1.872E-04 | global batch size: 256 | lm loss: 2.158381E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.933 | TFLOPs: 40.64 | 15: iteration 22640/ 125429 | consumed samples: 5795840 | consumed tokens: 11869880320 | elapsed time per iteration (s): 1.04 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.124737E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.232 | TFLOPs: 40.69 | 15: iteration 22650/ 125429 | consumed samples: 5798400 | consumed tokens: 11875123200 | elapsed time per iteration (s): 1.04 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.132195E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.306 | TFLOPs: 40.54 | 15: iteration 22660/ 125429 | consumed samples: 5800960 | consumed tokens: 11880366080 | elapsed time per iteration (s): 1.05 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.132547E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.210 | TFLOPs: 40.36 | 15: iteration 22670/ 125429 | consumed samples: 5803520 | consumed tokens: 11885608960 | elapsed time per iteration (s): 1.04 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.129959E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.414 | TFLOPs: 40.72 | 15: iteration 22680/ 125429 | consumed samples: 5806080 | consumed tokens: 11890851840 | elapsed time per iteration (s): 1.12 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.153802E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.116 | TFLOPs: 37.70 | 15: iteration 22690/ 125429 | consumed samples: 5808640 | consumed tokens: 11896094720 | elapsed time per iteration (s): 1.04 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.164051E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.353 | TFLOPs: 40.55 | 15: iteration 22700/ 125429 | consumed samples: 5811200 | consumed tokens: 11901337600 | elapsed time per iteration (s): 1.02 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.137278E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.449 | TFLOPs: 41.55 | 15: iteration 22710/ 125429 | consumed samples: 5813760 | consumed tokens: 11906580480 | elapsed time per iteration (s): 1.09 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.159098E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.875 | TFLOPs: 38.98 | 15: iteration 22720/ 125429 | consumed samples: 5816320 | consumed tokens: 11911823360 | elapsed time per iteration (s): 1.05 | learning rate: 1.871E-04 | global batch size: 256 | lm loss: 2.122377E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.906 | TFLOPs: 40.31 | 15: iteration 22730/ 125429 | consumed samples: 5818880 | consumed tokens: 11917066240 | elapsed time per iteration (s): 1.03 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.117213E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.845 | TFLOPs: 41.12 | 15: iteration 22740/ 125429 | consumed samples: 5821440 | consumed tokens: 11922309120 | elapsed time per iteration (s): 1.06 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.120672E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.868 | TFLOPs: 39.81 | 15: iteration 22750/ 125429 | consumed samples: 5824000 | consumed tokens: 11927552000 | elapsed time per iteration (s): 1.07 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.107732E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.488 | TFLOPs: 39.58 | 15: iteration 22760/ 125429 | consumed samples: 5826560 | consumed tokens: 11932794880 | elapsed time per iteration (s): 1.04 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.169830E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.463 | TFLOPs: 40.56 | 15: iteration 22770/ 125429 | consumed samples: 5829120 | consumed tokens: 11938037760 | elapsed time per iteration (s): 1.03 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.147658E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.434 | TFLOPs: 41.06 | 15: iteration 22780/ 125429 | consumed samples: 5831680 | consumed tokens: 11943280640 | elapsed time per iteration (s): 1.05 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.146958E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.437 | TFLOPs: 40.23 | 15: iteration 22790/ 125429 | consumed samples: 5834240 | consumed tokens: 11948523520 | elapsed time per iteration (s): 1.10 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.152386E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.733 | TFLOPs: 38.30 | 15: iteration 22800/ 125429 | consumed samples: 5836800 | consumed tokens: 11953766400 | elapsed time per iteration (s): 1.03 | learning rate: 1.870E-04 | global batch size: 256 | lm loss: 2.135101E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.522 | TFLOPs: 41.07 | 15: iteration 22810/ 125429 | consumed samples: 5839360 | consumed tokens: 11959009280 | elapsed time per iteration (s): 1.02 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.140553E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.112 | TFLOPs: 41.33 | 15: iteration 22820/ 125429 | consumed samples: 5841920 | consumed tokens: 11964252160 | elapsed time per iteration (s): 1.07 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.138282E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.271 | TFLOPs: 39.54 | 15: iteration 22830/ 125429 | consumed samples: 5844480 | consumed tokens: 11969495040 | elapsed time per iteration (s): 1.03 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.142656E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.722 | TFLOPs: 41.10 | 15: iteration 22840/ 125429 | consumed samples: 5847040 | consumed tokens: 11974737920 | elapsed time per iteration (s): 1.06 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.119212E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.365 | TFLOPs: 40.05 | 15: iteration 22850/ 125429 | consumed samples: 5849600 | consumed tokens: 11979980800 | elapsed time per iteration (s): 1.02 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.143827E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.339 | TFLOPs: 41.37 | 15: iteration 22860/ 125429 | consumed samples: 5852160 | consumed tokens: 11985223680 | elapsed time per iteration (s): 1.07 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.117755E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.339 | TFLOPs: 39.72 | 15: iteration 22870/ 125429 | consumed samples: 5854720 | consumed tokens: 11990466560 | elapsed time per iteration (s): 1.10 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.146990E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.951 | TFLOPs: 38.50 | 15: iteration 22880/ 125429 | consumed samples: 5857280 | consumed tokens: 11995709440 | elapsed time per iteration (s): 1.08 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.156743E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.442 | TFLOPs: 39.07 | 15: iteration 22890/ 125429 | consumed samples: 5859840 | consumed tokens: 12000952320 | elapsed time per iteration (s): 1.12 | learning rate: 1.869E-04 | global batch size: 256 | lm loss: 2.130894E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.475 | TFLOPs: 37.92 | 15: iteration 22900/ 125429 | consumed samples: 5862400 | consumed tokens: 12006195200 | elapsed time per iteration (s): 1.07 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.154361E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.573 | TFLOPs: 39.43 | 15: iteration 22910/ 125429 | consumed samples: 5864960 | consumed tokens: 12011438080 | elapsed time per iteration (s): 1.06 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.129135E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.444 | TFLOPs: 39.90 | 15: iteration 22920/ 125429 | consumed samples: 5867520 | consumed tokens: 12016680960 | elapsed time per iteration (s): 1.06 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.142851E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.550 | TFLOPs: 39.75 | 15: iteration 22930/ 125429 | consumed samples: 5870080 | consumed tokens: 12021923840 | elapsed time per iteration (s): 1.06 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.134771E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.038 | TFLOPs: 40.00 | 15: iteration 22940/ 125429 | consumed samples: 5872640 | consumed tokens: 12027166720 | elapsed time per iteration (s): 1.03 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.098437E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.557 | TFLOPs: 41.24 | 15: iteration 22950/ 125429 | consumed samples: 5875200 | consumed tokens: 12032409600 | elapsed time per iteration (s): 1.04 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.123899E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.027 | TFLOPs: 40.49 | 15: iteration 22960/ 125429 | consumed samples: 5877760 | consumed tokens: 12037652480 | elapsed time per iteration (s): 1.07 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.146771E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.026 | TFLOPs: 39.50 | 15: iteration 22970/ 125429 | consumed samples: 5880320 | consumed tokens: 12042895360 | elapsed time per iteration (s): 1.05 | learning rate: 1.868E-04 | global batch size: 256 | lm loss: 2.109311E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.845 | TFLOPs: 40.13 | 15: iteration 22980/ 125429 | consumed samples: 5882880 | consumed tokens: 12048138240 | elapsed time per iteration (s): 1.06 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.118461E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.526 | TFLOPs: 40.08 | 15: iteration 22990/ 125429 | consumed samples: 5885440 | consumed tokens: 12053381120 | elapsed time per iteration (s): 1.11 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.123663E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.852 | TFLOPs: 38.15 | 15: iteration 23000/ 125429 | consumed samples: 5888000 | consumed tokens: 12058624000 | elapsed time per iteration (s): 1.05 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.129981E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.389 | TFLOPs: 40.39 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 23000 | lm loss value: 2.089360E+00 | lm loss PPL: 8.079741E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 23000 to checkpoints_1b5 0: [2022-11-26 02:40:45,707] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step23000 is begin to save! 0: [2022-11-26 02:40:45,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:40:46,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:40:46,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:40:46,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:40:46,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:40:46,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:40:46,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:40:46,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:40:46,589] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:40:46,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:40:46,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:40:46,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:40:46,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:40:47,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:40:47,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:40:47,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:40:47,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:40:47,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:40:47,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:40:47,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:40:47,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:40:47,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:40:47,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:40:47,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:40:47,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:40:48,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:40:48,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:40:48,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:40:48,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:40:48,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:40:48,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:40:48,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:40:48,478] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:40:48,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:40:48,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:40:48,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:40:48,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:40:48,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:40:48,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:40:49,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:40:49,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:40:49,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:40:49,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:40:49,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:40:49,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:40:49,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:40:49,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:40:49,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:40:49,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:40:49,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:40:49,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:40:50,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:40:50,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:40:50,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:40:50,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_29-model_00-model_states.pt... 0: [2022-11-26 02:40:50,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_29-model_00-model_states.pt. 0: [2022-11-26 02:40:50,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:40:50,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:40:50,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/layer_32-model_00-model_states.pt... 0: [2022-11-26 02:40:50,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/layer_32-model_00-model_states.pt. 0: [2022-11-26 02:40:50,482] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step23000/mp_rank_00_model_states.pt 0: [2022-11-26 02:40:50,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:40:50,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:40:50,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:40:50,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:40:50,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-26 02:40:50,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-26 02:40:50,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:40:50,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:40:50,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:40:50,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-26 02:40:50,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 02:40:50,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:40:50,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 02:40:50,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 02:40:50,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 02:40:50,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 02:40:50,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 02:40:50,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 02:40:50,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 02:40:50,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 7: [2022-11-26 02:40:50,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:40:50,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:40:50,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 10: [2022-11-26 02:40:50,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:40:50,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 02:40:50,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 02:40:50,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 3: [2022-11-26 02:40:50,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:40:50,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:40:50,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:40:50,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 2: [2022-11-26 02:40:50,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 6: [2022-11-26 02:40:50,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:40:50,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 02:40:50,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:40:50,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:40:50,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:40:50,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:40:50,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 15: [2022-11-26 02:40:50,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:40:50,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 02:40:50,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:40:50,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:40:50,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 14: [2022-11-26 02:40:50,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:40:50,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:40:50,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:40:50,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 11: [2022-11-26 02:40:50,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 13: [2022-11-26 02:40:50,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:40:50,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:40:50,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:40:50,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:40:50,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:40:50,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:40:50,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:40:50,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:40:50,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 1: [2022-11-26 02:40:50,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:40:50,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 9: [2022-11-26 02:40:50,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: [2022-11-26 02:40:50,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:40:50,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 8: [2022-11-26 02:40:50,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:40:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 02:40:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:40:50,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:40:50,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 5: [2022-11-26 02:40:50,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:40:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step23000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 12: [2022-11-26 02:40:50,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step23000 is ready now! 0: successfully saved checkpoint at iteration 23000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 5364.83 15: iteration 23010/ 125429 | consumed samples: 5890560 | consumed tokens: 12063866880 | elapsed time per iteration (s): 1.64 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.139739E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 155.635 | TFLOPs: 25.72 | 15: iteration 23020/ 125429 | consumed samples: 5893120 | consumed tokens: 12069109760 | elapsed time per iteration (s): 1.03 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.124243E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.900 | TFLOPs: 40.97 | 15: iteration 23030/ 125429 | consumed samples: 5895680 | consumed tokens: 12074352640 | elapsed time per iteration (s): 1.09 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.124907E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.925 | TFLOPs: 38.99 | 15: iteration 23040/ 125429 | consumed samples: 5898240 | consumed tokens: 12079595520 | elapsed time per iteration (s): 1.05 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.126572E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.930 | TFLOPs: 40.15 | 15: iteration 23050/ 125429 | consumed samples: 5900800 | consumed tokens: 12084838400 | elapsed time per iteration (s): 1.08 | learning rate: 1.867E-04 | global batch size: 256 | lm loss: 2.106159E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.869 | TFLOPs: 39.31 | 15: iteration 23060/ 125429 | consumed samples: 5903360 | consumed tokens: 12090081280 | elapsed time per iteration (s): 1.06 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.152645E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.298 | TFLOPs: 40.04 | 15: iteration 23070/ 125429 | consumed samples: 5905920 | consumed tokens: 12095324160 | elapsed time per iteration (s): 1.07 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.124243E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.447 | TFLOPs: 39.41 | 15: iteration 23080/ 125429 | consumed samples: 5908480 | consumed tokens: 12100567040 | elapsed time per iteration (s): 1.03 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.152061E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.624 | TFLOPs: 41.09 | 15: iteration 23090/ 125429 | consumed samples: 5911040 | consumed tokens: 12105809920 | elapsed time per iteration (s): 1.03 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.106113E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.834 | TFLOPs: 40.96 | 15: iteration 23100/ 125429 | consumed samples: 5913600 | consumed tokens: 12111052800 | elapsed time per iteration (s): 1.04 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.122203E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.772 | TFLOPs: 40.62 | 15: iteration 23110/ 125429 | consumed samples: 5916160 | consumed tokens: 12116295680 | elapsed time per iteration (s): 1.06 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.119747E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.485 | TFLOPs: 40.07 | 15: iteration 23120/ 125429 | consumed samples: 5918720 | consumed tokens: 12121538560 | elapsed time per iteration (s): 1.04 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.120528E+00 | grad norm: 0.204 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.976 | TFLOPs: 40.65 | 15: iteration 23130/ 125429 | consumed samples: 5921280 | consumed tokens: 12126781440 | elapsed time per iteration (s): 1.04 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.202932E+00 | grad norm: 0.476 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.241 | TFLOPs: 40.86 | 15: iteration 23140/ 125429 | consumed samples: 5923840 | consumed tokens: 12132024320 | elapsed time per iteration (s): 1.06 | learning rate: 1.866E-04 | global batch size: 256 | lm loss: 2.165096E+00 | grad norm: 0.924 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.981 | TFLOPs: 39.99 | 15: iteration 23150/ 125429 | consumed samples: 5926400 | consumed tokens: 12137267200 | elapsed time per iteration (s): 1.04 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.182362E+00 | grad norm: 0.249 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.556 | TFLOPs: 40.58 | 15: iteration 23160/ 125429 | consumed samples: 5928960 | consumed tokens: 12142510080 | elapsed time per iteration (s): 1.04 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.135638E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.333 | TFLOPs: 40.87 | 15: iteration 23170/ 125429 | consumed samples: 5931520 | consumed tokens: 12147752960 | elapsed time per iteration (s): 1.10 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.175735E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.385 | TFLOPs: 38.40 | 15: iteration 23180/ 125429 | consumed samples: 5934080 | consumed tokens: 12152995840 | elapsed time per iteration (s): 1.10 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.156891E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.571 | TFLOPs: 38.43 | 15: iteration 23190/ 125429 | consumed samples: 5936640 | consumed tokens: 12158238720 | elapsed time per iteration (s): 1.03 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.174395E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.548 | TFLOPs: 41.07 | 15: iteration 23200/ 125429 | consumed samples: 5939200 | consumed tokens: 12163481600 | elapsed time per iteration (s): 1.03 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.148481E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.604 | TFLOPs: 41.08 | 15: iteration 23210/ 125429 | consumed samples: 5941760 | consumed tokens: 12168724480 | elapsed time per iteration (s): 1.02 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.128714E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.799 | TFLOPs: 41.45 | 15: iteration 23220/ 125429 | consumed samples: 5944320 | consumed tokens: 12173967360 | elapsed time per iteration (s): 1.04 | learning rate: 1.865E-04 | global batch size: 256 | lm loss: 2.139890E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.778 | TFLOPs: 40.62 | 15: iteration 23230/ 125429 | consumed samples: 5946880 | consumed tokens: 12179210240 | elapsed time per iteration (s): 1.05 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.131211E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.828 | TFLOPs: 40.46 | 15: iteration 23240/ 125429 | consumed samples: 5949440 | consumed tokens: 12184453120 | elapsed time per iteration (s): 1.08 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.133929E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.053 | TFLOPs: 39.34 | 15: iteration 23250/ 125429 | consumed samples: 5952000 | consumed tokens: 12189696000 | elapsed time per iteration (s): 1.05 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.153966E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.126 | TFLOPs: 40.18 | 15: iteration 23260/ 125429 | consumed samples: 5954560 | consumed tokens: 12194938880 | elapsed time per iteration (s): 1.04 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.127451E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.676 | TFLOPs: 40.60 | 15: iteration 23270/ 125429 | consumed samples: 5957120 | consumed tokens: 12200181760 | elapsed time per iteration (s): 1.04 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.155904E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.790 | TFLOPs: 40.78 | 15: iteration 23280/ 125429 | consumed samples: 5959680 | consumed tokens: 12205424640 | elapsed time per iteration (s): 1.10 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.136898E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.609 | TFLOPs: 38.61 | 15: iteration 23290/ 125429 | consumed samples: 5962240 | consumed tokens: 12210667520 | elapsed time per iteration (s): 1.07 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.132392E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.861 | TFLOPs: 39.64 | 15: iteration 23300/ 125429 | consumed samples: 5964800 | consumed tokens: 12215910400 | elapsed time per iteration (s): 1.04 | learning rate: 1.864E-04 | global batch size: 256 | lm loss: 2.153432E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.482 | TFLOPs: 40.57 | 15: iteration 23310/ 125429 | consumed samples: 5967360 | consumed tokens: 12221153280 | elapsed time per iteration (s): 1.09 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.136446E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.559 | TFLOPs: 38.76 | 15: iteration 23320/ 125429 | consumed samples: 5969920 | consumed tokens: 12226396160 | elapsed time per iteration (s): 1.05 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.146384E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.513 | TFLOPs: 40.41 | 15: iteration 23330/ 125429 | consumed samples: 5972480 | consumed tokens: 12231639040 | elapsed time per iteration (s): 1.21 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.119741E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.261 | TFLOPs: 34.91 | 15: iteration 23340/ 125429 | consumed samples: 5975040 | consumed tokens: 12236881920 | elapsed time per iteration (s): 1.14 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.143051E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.751 | TFLOPs: 36.98 | 15: iteration 23350/ 125429 | consumed samples: 5977600 | consumed tokens: 12242124800 | elapsed time per iteration (s): 1.07 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.139735E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.234 | TFLOPs: 39.70 | 15: iteration 23360/ 125429 | consumed samples: 5980160 | consumed tokens: 12247367680 | elapsed time per iteration (s): 1.12 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.143440E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.957 | TFLOPs: 37.67 | 15: iteration 23370/ 125429 | consumed samples: 5982720 | consumed tokens: 12252610560 | elapsed time per iteration (s): 1.03 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.124986E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.401 | TFLOPs: 40.88 | 15: iteration 23380/ 125429 | consumed samples: 5985280 | consumed tokens: 12257853440 | elapsed time per iteration (s): 1.11 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.146045E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.778 | TFLOPs: 37.97 | 15: iteration 23390/ 125429 | consumed samples: 5987840 | consumed tokens: 12263096320 | elapsed time per iteration (s): 1.05 | learning rate: 1.863E-04 | global batch size: 256 | lm loss: 2.134385E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.540 | TFLOPs: 40.41 | 15: iteration 23400/ 125429 | consumed samples: 5990400 | consumed tokens: 12268339200 | elapsed time per iteration (s): 1.06 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.103165E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.384 | TFLOPs: 39.73 | 15: iteration 23410/ 125429 | consumed samples: 5992960 | consumed tokens: 12273582080 | elapsed time per iteration (s): 1.06 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.127499E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.177 | TFLOPs: 39.86 | 15: iteration 23420/ 125429 | consumed samples: 5995520 | consumed tokens: 12278824960 | elapsed time per iteration (s): 1.05 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.139075E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.070 | TFLOPs: 40.33 | 15: iteration 23430/ 125429 | consumed samples: 5998080 | consumed tokens: 12284067840 | elapsed time per iteration (s): 1.04 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.115825E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.974 | TFLOPs: 40.65 | 15: iteration 23440/ 125429 | consumed samples: 6000640 | consumed tokens: 12289310720 | elapsed time per iteration (s): 1.06 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.137510E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.494 | TFLOPs: 39.91 | 15: iteration 23450/ 125429 | consumed samples: 6003200 | consumed tokens: 12294553600 | elapsed time per iteration (s): 1.07 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.118253E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.314 | TFLOPs: 39.55 | 15: iteration 23460/ 125429 | consumed samples: 6005760 | consumed tokens: 12299796480 | elapsed time per iteration (s): 1.09 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.118848E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.286 | TFLOPs: 38.88 | 15: iteration 23470/ 125429 | consumed samples: 6008320 | consumed tokens: 12305039360 | elapsed time per iteration (s): 1.05 | learning rate: 1.862E-04 | global batch size: 256 | lm loss: 2.148673E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.286 | TFLOPs: 40.37 | 15: iteration 23480/ 125429 | consumed samples: 6010880 | consumed tokens: 12310282240 | elapsed time per iteration (s): 1.05 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.146569E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.856 | TFLOPs: 40.13 | 15: iteration 23490/ 125429 | consumed samples: 6013440 | consumed tokens: 12315525120 | elapsed time per iteration (s): 1.06 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.135475E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.487 | TFLOPs: 40.07 | 15: iteration 23500/ 125429 | consumed samples: 6016000 | consumed tokens: 12320768000 | elapsed time per iteration (s): 1.04 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.141293E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.724 | TFLOPs: 40.61 | 15: iteration 23510/ 125429 | consumed samples: 6018560 | consumed tokens: 12326010880 | elapsed time per iteration (s): 1.06 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.104710E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.081 | TFLOPs: 39.84 | 15: iteration 23520/ 125429 | consumed samples: 6021120 | consumed tokens: 12331253760 | elapsed time per iteration (s): 1.06 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.151796E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.726 | TFLOPs: 39.95 | 15: iteration 23530/ 125429 | consumed samples: 6023680 | consumed tokens: 12336496640 | elapsed time per iteration (s): 1.06 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.135795E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.590 | TFLOPs: 39.92 | 15: iteration 23540/ 125429 | consumed samples: 6026240 | consumed tokens: 12341739520 | elapsed time per iteration (s): 1.04 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.210249E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.261 | TFLOPs: 40.86 | 15: iteration 23550/ 125429 | consumed samples: 6028800 | consumed tokens: 12346982400 | elapsed time per iteration (s): 1.07 | learning rate: 1.861E-04 | global batch size: 256 | lm loss: 2.165037E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.204 | TFLOPs: 39.37 | 15: iteration 23560/ 125429 | consumed samples: 6031360 | consumed tokens: 12352225280 | elapsed time per iteration (s): 1.03 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.154750E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.181 | TFLOPs: 41.01 | 15: iteration 23570/ 125429 | consumed samples: 6033920 | consumed tokens: 12357468160 | elapsed time per iteration (s): 1.05 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.125286E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.199 | TFLOPs: 40.19 | 15: iteration 23580/ 125429 | consumed samples: 6036480 | consumed tokens: 12362711040 | elapsed time per iteration (s): 1.08 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.119661E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.818 | TFLOPs: 39.30 | 15: iteration 23590/ 125429 | consumed samples: 6039040 | consumed tokens: 12367953920 | elapsed time per iteration (s): 1.05 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.104158E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.921 | TFLOPs: 40.31 | 15: iteration 23600/ 125429 | consumed samples: 6041600 | consumed tokens: 12373196800 | elapsed time per iteration (s): 1.05 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.132176E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.827 | TFLOPs: 40.29 | 15: iteration 23610/ 125429 | consumed samples: 6044160 | consumed tokens: 12378439680 | elapsed time per iteration (s): 1.04 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.143285E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.026 | TFLOPs: 40.49 | 15: iteration 23620/ 125429 | consumed samples: 6046720 | consumed tokens: 12383682560 | elapsed time per iteration (s): 1.04 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.153885E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.213 | TFLOPs: 40.85 | 15: iteration 23630/ 125429 | consumed samples: 6049280 | consumed tokens: 12388925440 | elapsed time per iteration (s): 1.06 | learning rate: 1.860E-04 | global batch size: 256 | lm loss: 2.110964E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.882 | TFLOPs: 39.81 | 15: iteration 23640/ 125429 | consumed samples: 6051840 | consumed tokens: 12394168320 | elapsed time per iteration (s): 1.05 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.127634E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.254 | TFLOPs: 40.20 | 15: iteration 23650/ 125429 | consumed samples: 6054400 | consumed tokens: 12399411200 | elapsed time per iteration (s): 1.05 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.095768E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.472 | TFLOPs: 40.24 | 15: iteration 23660/ 125429 | consumed samples: 6056960 | consumed tokens: 12404654080 | elapsed time per iteration (s): 1.04 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.126293E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.466 | TFLOPs: 40.57 | 15: iteration 23670/ 125429 | consumed samples: 6059520 | consumed tokens: 12409896960 | elapsed time per iteration (s): 1.03 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.126058E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.489 | TFLOPs: 40.90 | 15: iteration 23680/ 125429 | consumed samples: 6062080 | consumed tokens: 12415139840 | elapsed time per iteration (s): 1.04 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.119245E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.532 | TFLOPs: 40.74 | 15: iteration 23690/ 125429 | consumed samples: 6064640 | consumed tokens: 12420382720 | elapsed time per iteration (s): 1.08 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.137228E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.842 | TFLOPs: 39.31 | 15: iteration 23700/ 125429 | consumed samples: 6067200 | consumed tokens: 12425625600 | elapsed time per iteration (s): 1.02 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.136540E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.868 | TFLOPs: 41.46 | 15: iteration 23710/ 125429 | consumed samples: 6069760 | consumed tokens: 12430868480 | elapsed time per iteration (s): 1.03 | learning rate: 1.859E-04 | global batch size: 256 | lm loss: 2.126811E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.521 | TFLOPs: 40.90 | 15: iteration 23720/ 125429 | consumed samples: 6072320 | consumed tokens: 12436111360 | elapsed time per iteration (s): 1.04 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.122483E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.261 | TFLOPs: 40.86 | 15: iteration 23730/ 125429 | consumed samples: 6074880 | consumed tokens: 12441354240 | elapsed time per iteration (s): 1.04 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.146635E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.714 | TFLOPs: 40.77 | 15: iteration 23740/ 125429 | consumed samples: 6077440 | consumed tokens: 12446597120 | elapsed time per iteration (s): 1.03 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.091969E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.456 | TFLOPs: 40.89 | 15: iteration 23750/ 125429 | consumed samples: 6080000 | consumed tokens: 12451840000 | elapsed time per iteration (s): 1.05 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.102116E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.699 | TFLOPs: 40.44 | 15: iteration 23760/ 125429 | consumed samples: 6082560 | consumed tokens: 12457082880 | elapsed time per iteration (s): 1.03 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.122607E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.993 | TFLOPs: 40.98 | 15: iteration 23770/ 125429 | consumed samples: 6085120 | consumed tokens: 12462325760 | elapsed time per iteration (s): 1.06 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.145345E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.589 | TFLOPs: 40.09 | 15: iteration 23780/ 125429 | consumed samples: 6087680 | consumed tokens: 12467568640 | elapsed time per iteration (s): 1.07 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.149017E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.826 | TFLOPs: 39.63 | 15: iteration 23790/ 125429 | consumed samples: 6090240 | consumed tokens: 12472811520 | elapsed time per iteration (s): 1.06 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.153703E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.815 | TFLOPs: 39.80 | 15: iteration 23800/ 125429 | consumed samples: 6092800 | consumed tokens: 12478054400 | elapsed time per iteration (s): 1.08 | learning rate: 1.858E-04 | global batch size: 256 | lm loss: 2.128516E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.377 | TFLOPs: 39.06 | 15: iteration 23810/ 125429 | consumed samples: 6095360 | consumed tokens: 12483297280 | elapsed time per iteration (s): 1.07 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.111777E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.371 | TFLOPs: 39.72 | 15: iteration 23820/ 125429 | consumed samples: 6097920 | consumed tokens: 12488540160 | elapsed time per iteration (s): 1.04 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.118016E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.463 | TFLOPs: 40.56 | 15: iteration 23830/ 125429 | consumed samples: 6100480 | consumed tokens: 12493783040 | elapsed time per iteration (s): 1.08 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.126149E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.487 | TFLOPs: 39.25 | 15: iteration 23840/ 125429 | consumed samples: 6103040 | consumed tokens: 12499025920 | elapsed time per iteration (s): 1.03 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.098311E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.309 | TFLOPs: 41.20 | 15: iteration 23850/ 125429 | consumed samples: 6105600 | consumed tokens: 12504268800 | elapsed time per iteration (s): 1.03 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.134941E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.624 | TFLOPs: 40.92 | 15: iteration 23860/ 125429 | consumed samples: 6108160 | consumed tokens: 12509511680 | elapsed time per iteration (s): 1.50 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.109418E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.163 | TFLOPs: 28.29 | 15: iteration 23870/ 125429 | consumed samples: 6110720 | consumed tokens: 12514754560 | elapsed time per iteration (s): 1.04 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.129158E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.138 | TFLOPs: 40.84 | 15: iteration 23880/ 125429 | consumed samples: 6113280 | consumed tokens: 12519997440 | elapsed time per iteration (s): 1.03 | learning rate: 1.857E-04 | global batch size: 256 | lm loss: 2.156466E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.780 | TFLOPs: 40.95 | 15: iteration 23890/ 125429 | consumed samples: 6115840 | consumed tokens: 12525240320 | elapsed time per iteration (s): 1.05 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.135188E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.090 | TFLOPs: 40.34 | 15: iteration 23900/ 125429 | consumed samples: 6118400 | consumed tokens: 12530483200 | elapsed time per iteration (s): 1.03 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.165368E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.901 | TFLOPs: 41.13 | 15: iteration 23910/ 125429 | consumed samples: 6120960 | consumed tokens: 12535726080 | elapsed time per iteration (s): 1.04 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.121445E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.789 | TFLOPs: 40.78 | 15: iteration 23920/ 125429 | consumed samples: 6123520 | consumed tokens: 12540968960 | elapsed time per iteration (s): 1.07 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.109111E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.315 | TFLOPs: 39.71 | 15: iteration 23930/ 125429 | consumed samples: 6126080 | consumed tokens: 12546211840 | elapsed time per iteration (s): 1.04 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.143936E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.634 | TFLOPs: 40.76 | 15: iteration 23940/ 125429 | consumed samples: 6128640 | consumed tokens: 12551454720 | elapsed time per iteration (s): 1.06 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.123692E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.615 | TFLOPs: 39.93 | 15: iteration 23950/ 125429 | consumed samples: 6131200 | consumed tokens: 12556697600 | elapsed time per iteration (s): 1.07 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.123125E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.582 | TFLOPs: 39.59 | 15: iteration 23960/ 125429 | consumed samples: 6133760 | consumed tokens: 12561940480 | elapsed time per iteration (s): 1.04 | learning rate: 1.856E-04 | global batch size: 256 | lm loss: 2.103727E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.631 | TFLOPs: 40.76 | 15: iteration 23970/ 125429 | consumed samples: 6136320 | consumed tokens: 12567183360 | elapsed time per iteration (s): 1.07 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.137015E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.789 | TFLOPs: 39.46 | 15: iteration 23980/ 125429 | consumed samples: 6138880 | consumed tokens: 12572426240 | elapsed time per iteration (s): 1.05 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.135807E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.063 | TFLOPs: 40.33 | 15: iteration 23990/ 125429 | consumed samples: 6141440 | consumed tokens: 12577669120 | elapsed time per iteration (s): 1.05 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.134398E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.835 | TFLOPs: 40.13 | 0: [2022-11-26 02:58:31,302] [INFO] [logging.py:68:log_dist] [Rank 0] step=24000, skipped=0, lr=[0.00018550472707307334, 0.00018550472707307334, 0.00018550472707307334], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 24000/ 125429 | consumed samples: 6144000 | consumed tokens: 12582912000 | elapsed time per iteration (s): 1.04 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.122988E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.982 | TFLOPs: 40.49 | 0: steps: 24000 loss: 2.2016 iter time (s): 1.065 samples/sec: 240.321 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 24000 | lm loss value: 2.076908E+00 | lm loss PPL: 7.979756E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 24000 to checkpoints_1b5 0: [2022-11-26 02:58:31,692] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step24000 is begin to save! 0: [2022-11-26 02:58:31,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_01-model_00-model_states.pt... 0: [2022-11-26 02:58:31,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_01-model_00-model_states.pt. 0: [2022-11-26 02:58:31,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_03-model_00-model_states.pt... 0: [2022-11-26 02:58:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_03-model_00-model_states.pt. 0: [2022-11-26 02:58:32,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_04-model_00-model_states.pt... 0: [2022-11-26 02:58:32,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_04-model_00-model_states.pt. 0: [2022-11-26 02:58:32,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_05-model_00-model_states.pt... 0: [2022-11-26 02:58:32,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_05-model_00-model_states.pt. 0: [2022-11-26 02:58:32,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_06-model_00-model_states.pt... 0: [2022-11-26 02:58:32,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_06-model_00-model_states.pt. 0: [2022-11-26 02:58:32,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_07-model_00-model_states.pt... 0: [2022-11-26 02:58:32,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_07-model_00-model_states.pt. 0: [2022-11-26 02:58:32,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_08-model_00-model_states.pt... 0: [2022-11-26 02:58:32,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_08-model_00-model_states.pt. 0: [2022-11-26 02:58:32,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_09-model_00-model_states.pt... 0: [2022-11-26 02:58:32,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_09-model_00-model_states.pt. 0: [2022-11-26 02:58:32,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_10-model_00-model_states.pt... 0: [2022-11-26 02:58:32,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_10-model_00-model_states.pt. 0: [2022-11-26 02:58:32,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_11-model_00-model_states.pt... 0: [2022-11-26 02:58:33,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_11-model_00-model_states.pt. 0: [2022-11-26 02:58:33,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_12-model_00-model_states.pt... 0: [2022-11-26 02:58:33,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_12-model_00-model_states.pt. 0: [2022-11-26 02:58:33,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_13-model_00-model_states.pt... 0: [2022-11-26 02:58:33,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_13-model_00-model_states.pt. 0: [2022-11-26 02:58:33,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_14-model_00-model_states.pt... 0: [2022-11-26 02:58:33,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_14-model_00-model_states.pt. 0: [2022-11-26 02:58:33,361] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_15-model_00-model_states.pt... 0: [2022-11-26 02:58:33,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_15-model_00-model_states.pt. 0: [2022-11-26 02:58:33,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_16-model_00-model_states.pt... 0: [2022-11-26 02:58:33,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_16-model_00-model_states.pt. 0: [2022-11-26 02:58:33,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_17-model_00-model_states.pt... 0: [2022-11-26 02:58:33,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_17-model_00-model_states.pt. 0: [2022-11-26 02:58:33,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_18-model_00-model_states.pt... 0: [2022-11-26 02:58:33,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_18-model_00-model_states.pt. 0: [2022-11-26 02:58:33,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_19-model_00-model_states.pt... 0: [2022-11-26 02:58:33,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_19-model_00-model_states.pt. 0: [2022-11-26 02:58:33,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_20-model_00-model_states.pt... 0: [2022-11-26 02:58:34,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_20-model_00-model_states.pt. 0: [2022-11-26 02:58:34,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_21-model_00-model_states.pt... 0: [2022-11-26 02:58:34,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_21-model_00-model_states.pt. 0: [2022-11-26 02:58:34,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_22-model_00-model_states.pt... 0: [2022-11-26 02:58:34,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_22-model_00-model_states.pt. 0: [2022-11-26 02:58:34,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_23-model_00-model_states.pt... 0: [2022-11-26 02:58:34,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_23-model_00-model_states.pt. 0: [2022-11-26 02:58:34,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_24-model_00-model_states.pt... 0: [2022-11-26 02:58:34,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_24-model_00-model_states.pt. 0: [2022-11-26 02:58:34,474] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_25-model_00-model_states.pt... 0: [2022-11-26 02:58:34,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_25-model_00-model_states.pt. 0: [2022-11-26 02:58:34,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_26-model_00-model_states.pt... 0: [2022-11-26 02:58:34,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_26-model_00-model_states.pt. 0: [2022-11-26 02:58:34,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_27-model_00-model_states.pt... 0: [2022-11-26 02:58:34,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_27-model_00-model_states.pt. 0: [2022-11-26 02:58:34,807] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_28-model_00-model_states.pt... 0: [2022-11-26 02:58:34,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_28-model_00-model_states.pt. 0: [2022-11-26 02:58:34,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_29-model_00-model_states.pt... 0: [2022-11-26 02:58:35,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_29-model_00-model_states.pt. 0: [2022-11-26 02:58:35,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_30-model_00-model_states.pt... 0: [2022-11-26 02:58:35,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_30-model_00-model_states.pt. 0: [2022-11-26 02:58:35,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/layer_32-model_00-model_states.pt... 0: [2022-11-26 02:58:35,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/layer_32-model_00-model_states.pt. 0: [2022-11-26 02:58:35,132] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step24000/mp_rank_00_model_states.pt 0: [2022-11-26 02:58:35,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/mp_rank_00_model_states.pt... 0: [2022-11-26 02:58:35,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/mp_rank_00_model_states.pt. 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 15: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 02:58:35,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step24000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 0: [2022-11-26 02:58:35,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:58:35,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:58:35,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 02:58:35,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 02:58:35,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 02:58:35,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:58:35,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:58:35,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 02:58:35,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-26 02:58:35,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 02:58:35,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 02:58:35,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 02:58:35,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:58:35,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:58:35,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:58:35,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:58:35,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 02:58:35,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 02:58:35,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 02:58:35,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 02:58:35,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 02:58:35,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 02:58:35,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 5: [2022-11-26 02:58:35,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 5: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:58:35,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-26 02:58:35,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 02:58:35,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-26 02:58:35,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 02:58:35,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:58:35,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:58:35,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:58:35,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:58:35,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-26 02:58:35,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 02:58:35,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 02:58:35,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-26 02:58:35,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 02:58:35,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 02:58:35,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 4: [2022-11-26 02:58:35,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:58:35,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:58:35,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 02:58:35,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-26 02:58:35,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 10: [2022-11-26 02:58:35,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 10: [2022-11-26 02:58:35,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 9: [2022-11-26 02:58:35,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 02:58:35,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 12: [2022-11-26 02:58:35,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 02:58:35,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 02:58:35,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 2: [2022-11-26 02:58:35,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:58:35,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 11: [2022-11-26 02:58:35,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 02:58:35,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 02:58:35,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 7: [2022-11-26 02:58:35,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 02:58:35,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 02:58:35,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 02:58:35,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 3: [2022-11-26 02:58:35,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 02:58:35,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 02:58:35,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 6: [2022-11-26 02:58:35,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 02:58:35,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 02:58:35,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: [2022-11-26 02:58:35,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 02:58:35,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 02:58:35,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-26 02:58:35,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-26 02:58:35,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 02:58:35,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 02:58:35,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 8: [2022-11-26 02:58:35,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:58:35,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 1: [2022-11-26 02:58:35,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 02:58:35,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 02:58:35,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 15: [2022-11-26 02:58:35,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 02:58:35,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 02:58:35,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 14: [2022-11-26 02:58:35,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step24000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 13: [2022-11-26 02:58:35,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step24000 is ready now! 0: successfully saved checkpoint at iteration 24000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3898.53 15: iteration 24010/ 125429 | consumed samples: 6146560 | consumed tokens: 12588154880 | elapsed time per iteration (s): 1.49 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.108290E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.311 | TFLOPs: 28.48 | 15: iteration 24020/ 125429 | consumed samples: 6149120 | consumed tokens: 12593397760 | elapsed time per iteration (s): 1.04 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.124347E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.872 | TFLOPs: 40.63 | 15: iteration 24030/ 125429 | consumed samples: 6151680 | consumed tokens: 12598640640 | elapsed time per iteration (s): 1.03 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.141505E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.084 | TFLOPs: 41.00 | 15: iteration 24040/ 125429 | consumed samples: 6154240 | consumed tokens: 12603883520 | elapsed time per iteration (s): 1.03 | learning rate: 1.855E-04 | global batch size: 256 | lm loss: 2.120503E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.652 | TFLOPs: 41.09 | 15: iteration 24050/ 125429 | consumed samples: 6156800 | consumed tokens: 12609126400 | elapsed time per iteration (s): 1.04 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.136082E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.638 | TFLOPs: 40.76 | 15: iteration 24060/ 125429 | consumed samples: 6159360 | consumed tokens: 12614369280 | elapsed time per iteration (s): 1.03 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.131402E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.483 | TFLOPs: 40.90 | 15: iteration 24070/ 125429 | consumed samples: 6161920 | consumed tokens: 12619612160 | elapsed time per iteration (s): 1.04 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.115895E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.633 | TFLOPs: 40.76 | 15: iteration 24080/ 125429 | consumed samples: 6164480 | consumed tokens: 12624855040 | elapsed time per iteration (s): 1.04 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.141480E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.464 | TFLOPs: 40.73 | 15: iteration 24090/ 125429 | consumed samples: 6167040 | consumed tokens: 12630097920 | elapsed time per iteration (s): 1.06 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.119211E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.735 | TFLOPs: 39.78 | 15: iteration 24100/ 125429 | consumed samples: 6169600 | consumed tokens: 12635340800 | elapsed time per iteration (s): 1.03 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.097719E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.048 | TFLOPs: 41.16 | 15: iteration 24110/ 125429 | consumed samples: 6172160 | consumed tokens: 12640583680 | elapsed time per iteration (s): 1.08 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.135136E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.363 | TFLOPs: 39.23 | 15: iteration 24120/ 125429 | consumed samples: 6174720 | consumed tokens: 12645826560 | elapsed time per iteration (s): 1.03 | learning rate: 1.854E-04 | global batch size: 256 | lm loss: 2.135828E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.075 | TFLOPs: 41.00 | 15: iteration 24130/ 125429 | consumed samples: 6177280 | consumed tokens: 12651069440 | elapsed time per iteration (s): 1.03 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.130817E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.955 | TFLOPs: 40.98 | 15: iteration 24140/ 125429 | consumed samples: 6179840 | consumed tokens: 12656312320 | elapsed time per iteration (s): 1.16 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.100084E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.162 | TFLOPs: 36.55 | 15: iteration 24150/ 125429 | consumed samples: 6182400 | consumed tokens: 12661555200 | elapsed time per iteration (s): 1.11 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.127214E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.462 | TFLOPs: 38.09 | 15: iteration 24160/ 125429 | consumed samples: 6184960 | consumed tokens: 12666798080 | elapsed time per iteration (s): 1.05 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.133706E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.013 | TFLOPs: 40.33 | 15: iteration 24170/ 125429 | consumed samples: 6187520 | consumed tokens: 12672040960 | elapsed time per iteration (s): 1.04 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.120938E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.036 | TFLOPs: 40.49 | 15: iteration 24180/ 125429 | consumed samples: 6190080 | consumed tokens: 12677283840 | elapsed time per iteration (s): 1.07 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.130097E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.691 | TFLOPs: 39.61 | 15: iteration 24190/ 125429 | consumed samples: 6192640 | consumed tokens: 12682526720 | elapsed time per iteration (s): 1.06 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.119446E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.418 | TFLOPs: 39.73 | 15: iteration 24200/ 125429 | consumed samples: 6195200 | consumed tokens: 12687769600 | elapsed time per iteration (s): 1.06 | learning rate: 1.853E-04 | global batch size: 256 | lm loss: 2.138952E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.301 | TFLOPs: 40.04 | 15: iteration 24210/ 125429 | consumed samples: 6197760 | consumed tokens: 12693012480 | elapsed time per iteration (s): 1.05 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.123643E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.148 | TFLOPs: 40.35 | 15: iteration 24220/ 125429 | consumed samples: 6200320 | consumed tokens: 12698255360 | elapsed time per iteration (s): 1.06 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.099430E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.349 | TFLOPs: 39.88 | 15: iteration 24230/ 125429 | consumed samples: 6202880 | consumed tokens: 12703498240 | elapsed time per iteration (s): 1.05 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.115826E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.721 | TFLOPs: 40.44 | 15: iteration 24240/ 125429 | consumed samples: 6205440 | consumed tokens: 12708741120 | elapsed time per iteration (s): 1.03 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.124851E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.514 | TFLOPs: 41.23 | 15: iteration 24250/ 125429 | consumed samples: 6208000 | consumed tokens: 12713984000 | elapsed time per iteration (s): 1.04 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.130300E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.081 | TFLOPs: 40.67 | 15: iteration 24260/ 125429 | consumed samples: 6210560 | consumed tokens: 12719226880 | elapsed time per iteration (s): 1.08 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.119660E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.048 | TFLOPs: 39.34 | 15: iteration 24270/ 125429 | consumed samples: 6213120 | consumed tokens: 12724469760 | elapsed time per iteration (s): 1.06 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.141114E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.696 | TFLOPs: 39.94 | 15: iteration 24280/ 125429 | consumed samples: 6215680 | consumed tokens: 12729712640 | elapsed time per iteration (s): 1.04 | learning rate: 1.852E-04 | global batch size: 256 | lm loss: 2.128720E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.405 | TFLOPs: 40.72 | 15: iteration 24290/ 125429 | consumed samples: 6218240 | consumed tokens: 12734955520 | elapsed time per iteration (s): 1.07 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.094067E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.985 | TFLOPs: 39.49 | 15: iteration 24300/ 125429 | consumed samples: 6220800 | consumed tokens: 12740198400 | elapsed time per iteration (s): 1.08 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.115273E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.969 | TFLOPs: 39.33 | 15: iteration 24310/ 125429 | consumed samples: 6223360 | consumed tokens: 12745441280 | elapsed time per iteration (s): 1.11 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.111570E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.601 | TFLOPs: 37.94 | 15: iteration 24320/ 125429 | consumed samples: 6225920 | consumed tokens: 12750684160 | elapsed time per iteration (s): 1.08 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.124058E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.740 | TFLOPs: 39.29 | 15: iteration 24330/ 125429 | consumed samples: 6228480 | consumed tokens: 12755927040 | elapsed time per iteration (s): 1.09 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.122451E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.574 | TFLOPs: 38.77 | 15: iteration 24340/ 125429 | consumed samples: 6231040 | consumed tokens: 12761169920 | elapsed time per iteration (s): 1.05 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.132457E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.736 | TFLOPs: 40.11 | 15: iteration 24350/ 125429 | consumed samples: 6233600 | consumed tokens: 12766412800 | elapsed time per iteration (s): 1.05 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.127122E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.177 | TFLOPs: 40.35 | 15: iteration 24360/ 125429 | consumed samples: 6236160 | consumed tokens: 12771655680 | elapsed time per iteration (s): 1.07 | learning rate: 1.851E-04 | global batch size: 256 | lm loss: 2.118116E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.599 | TFLOPs: 39.43 | 15: iteration 24370/ 125429 | consumed samples: 6238720 | consumed tokens: 12776898560 | elapsed time per iteration (s): 1.13 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.128885E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.166 | TFLOPs: 37.54 | 15: iteration 24380/ 125429 | consumed samples: 6241280 | consumed tokens: 12782141440 | elapsed time per iteration (s): 1.05 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.094133E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.140 | TFLOPs: 40.18 | 15: iteration 24390/ 125429 | consumed samples: 6243840 | consumed tokens: 12787384320 | elapsed time per iteration (s): 1.03 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.100075E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.013 | TFLOPs: 40.99 | 15: iteration 24400/ 125429 | consumed samples: 6246400 | consumed tokens: 12792627200 | elapsed time per iteration (s): 1.04 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.138486E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.124 | TFLOPs: 40.84 | 15: iteration 24410/ 125429 | consumed samples: 6248960 | consumed tokens: 12797870080 | elapsed time per iteration (s): 1.08 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.129251E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.843 | TFLOPs: 39.31 | 15: iteration 24420/ 125429 | consumed samples: 6251520 | consumed tokens: 12803112960 | elapsed time per iteration (s): 1.05 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.138396E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.302 | TFLOPs: 40.21 | 15: iteration 24430/ 125429 | consumed samples: 6254080 | consumed tokens: 12808355840 | elapsed time per iteration (s): 1.07 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.139264E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.102 | TFLOPs: 39.51 | 15: iteration 24440/ 125429 | consumed samples: 6256640 | consumed tokens: 12813598720 | elapsed time per iteration (s): 1.06 | learning rate: 1.850E-04 | global batch size: 256 | lm loss: 2.127157E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.762 | TFLOPs: 39.79 | 15: iteration 24450/ 125429 | consumed samples: 6259200 | consumed tokens: 12818841600 | elapsed time per iteration (s): 1.02 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.107201E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.462 | TFLOPs: 41.39 | 15: iteration 24460/ 125429 | consumed samples: 6261760 | consumed tokens: 12824084480 | elapsed time per iteration (s): 1.08 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.150051E+00 | grad norm: 0.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.859 | TFLOPs: 39.31 | 15: iteration 24470/ 125429 | consumed samples: 6264320 | consumed tokens: 12829327360 | elapsed time per iteration (s): 1.04 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.129766E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.038 | TFLOPs: 40.49 | 15: iteration 24480/ 125429 | consumed samples: 6266880 | consumed tokens: 12834570240 | elapsed time per iteration (s): 1.08 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.148626E+00 | grad norm: 0.282 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.666 | TFLOPs: 39.28 | 15: iteration 24490/ 125429 | consumed samples: 6269440 | consumed tokens: 12839813120 | elapsed time per iteration (s): 1.09 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.139974E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.958 | TFLOPs: 38.66 | 15: iteration 24500/ 125429 | consumed samples: 6272000 | consumed tokens: 12845056000 | elapsed time per iteration (s): 1.05 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.142843E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.554 | TFLOPs: 40.25 | 15: iteration 24510/ 125429 | consumed samples: 6274560 | consumed tokens: 12850298880 | elapsed time per iteration (s): 1.06 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.151397E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.955 | TFLOPs: 39.82 | 15: iteration 24520/ 125429 | consumed samples: 6277120 | consumed tokens: 12855541760 | elapsed time per iteration (s): 1.05 | learning rate: 1.849E-04 | global batch size: 256 | lm loss: 2.165289E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.763 | TFLOPs: 40.28 | 15: iteration 24530/ 125429 | consumed samples: 6279680 | consumed tokens: 12860784640 | elapsed time per iteration (s): 1.06 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.129048E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.452 | TFLOPs: 39.74 | 15: iteration 24540/ 125429 | consumed samples: 6282240 | consumed tokens: 12866027520 | elapsed time per iteration (s): 1.04 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.148070E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.403 | TFLOPs: 40.72 | 15: iteration 24550/ 125429 | consumed samples: 6284800 | consumed tokens: 12871270400 | elapsed time per iteration (s): 1.04 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.128422E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.280 | TFLOPs: 40.53 | 15: iteration 24560/ 125429 | consumed samples: 6287360 | consumed tokens: 12876513280 | elapsed time per iteration (s): 1.06 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.135114E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.857 | TFLOPs: 39.80 | 15: iteration 24570/ 125429 | consumed samples: 6289920 | consumed tokens: 12881756160 | elapsed time per iteration (s): 1.03 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.155273E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.654 | TFLOPs: 40.93 | 15: iteration 24580/ 125429 | consumed samples: 6292480 | consumed tokens: 12886999040 | elapsed time per iteration (s): 1.03 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.146416E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.890 | TFLOPs: 41.13 | 15: iteration 24590/ 125429 | consumed samples: 6295040 | consumed tokens: 12892241920 | elapsed time per iteration (s): 1.83 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.084473E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 139.991 | TFLOPs: 23.13 | 15: iteration 24600/ 125429 | consumed samples: 6297600 | consumed tokens: 12897484800 | elapsed time per iteration (s): 1.05 | learning rate: 1.848E-04 | global batch size: 256 | lm loss: 2.121712E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.943 | TFLOPs: 40.31 | 15: iteration 24610/ 125429 | consumed samples: 6300160 | consumed tokens: 12902727680 | elapsed time per iteration (s): 1.08 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.130456E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.143 | TFLOPs: 39.19 | 15: iteration 24620/ 125429 | consumed samples: 6302720 | consumed tokens: 12907970560 | elapsed time per iteration (s): 1.05 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.160686E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.642 | TFLOPs: 40.43 | 15: iteration 24630/ 125429 | consumed samples: 6305280 | consumed tokens: 12913213440 | elapsed time per iteration (s): 1.06 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.122985E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.770 | TFLOPs: 39.95 | 15: iteration 24640/ 125429 | consumed samples: 6307840 | consumed tokens: 12918456320 | elapsed time per iteration (s): 1.08 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.148622E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.045 | TFLOPs: 39.01 | 15: iteration 24650/ 125429 | consumed samples: 6310400 | consumed tokens: 12923699200 | elapsed time per iteration (s): 1.06 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.137367E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.843 | TFLOPs: 39.97 | 15: iteration 24660/ 125429 | consumed samples: 6312960 | consumed tokens: 12928942080 | elapsed time per iteration (s): 1.07 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.132362E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.910 | TFLOPs: 39.65 | 15: iteration 24670/ 125429 | consumed samples: 6315520 | consumed tokens: 12934184960 | elapsed time per iteration (s): 1.07 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.134949E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.534 | TFLOPs: 39.58 | 15: iteration 24680/ 125429 | consumed samples: 6318080 | consumed tokens: 12939427840 | elapsed time per iteration (s): 1.04 | learning rate: 1.847E-04 | global batch size: 256 | lm loss: 2.119561E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.173 | TFLOPs: 40.68 | 15: iteration 24690/ 125429 | consumed samples: 6320640 | consumed tokens: 12944670720 | elapsed time per iteration (s): 1.05 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.154914E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.130 | TFLOPs: 40.34 | 15: iteration 24700/ 125429 | consumed samples: 6323200 | consumed tokens: 12949913600 | elapsed time per iteration (s): 1.04 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.145822E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.321 | TFLOPs: 40.54 | 15: iteration 24710/ 125429 | consumed samples: 6325760 | consumed tokens: 12955156480 | elapsed time per iteration (s): 1.09 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.075213E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.589 | TFLOPs: 38.93 | 15: iteration 24720/ 125429 | consumed samples: 6328320 | consumed tokens: 12960399360 | elapsed time per iteration (s): 1.05 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.125572E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.938 | TFLOPs: 40.15 | 15: iteration 24730/ 125429 | consumed samples: 6330880 | consumed tokens: 12965642240 | elapsed time per iteration (s): 1.05 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.104238E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.207 | TFLOPs: 40.19 | 15: iteration 24740/ 125429 | consumed samples: 6333440 | consumed tokens: 12970885120 | elapsed time per iteration (s): 1.09 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.120926E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.603 | TFLOPs: 38.77 | 15: iteration 24750/ 125429 | consumed samples: 6336000 | consumed tokens: 12976128000 | elapsed time per iteration (s): 1.04 | learning rate: 1.846E-04 | global batch size: 256 | lm loss: 2.104421E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.847 | TFLOPs: 40.79 | 15: iteration 24760/ 125429 | consumed samples: 6338560 | consumed tokens: 12981370880 | elapsed time per iteration (s): 1.08 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.104421E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.466 | TFLOPs: 39.24 | 15: iteration 24770/ 125429 | consumed samples: 6341120 | consumed tokens: 12986613760 | elapsed time per iteration (s): 1.05 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.132469E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.684 | TFLOPs: 40.27 | 15: iteration 24780/ 125429 | consumed samples: 6343680 | consumed tokens: 12991856640 | elapsed time per iteration (s): 1.06 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.123588E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.320 | TFLOPs: 40.05 | 15: iteration 24790/ 125429 | consumed samples: 6346240 | consumed tokens: 12997099520 | elapsed time per iteration (s): 1.03 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.125804E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.588 | TFLOPs: 41.25 | 15: iteration 24800/ 125429 | consumed samples: 6348800 | consumed tokens: 13002342400 | elapsed time per iteration (s): 1.04 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.122690E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.842 | TFLOPs: 40.63 | 15: iteration 24810/ 125429 | consumed samples: 6351360 | consumed tokens: 13007585280 | elapsed time per iteration (s): 1.06 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.142918E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.411 | TFLOPs: 39.89 | 15: iteration 24820/ 125429 | consumed samples: 6353920 | consumed tokens: 13012828160 | elapsed time per iteration (s): 1.03 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.131661E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.041 | TFLOPs: 40.99 | 15: iteration 24830/ 125429 | consumed samples: 6356480 | consumed tokens: 13018071040 | elapsed time per iteration (s): 1.04 | learning rate: 1.845E-04 | global batch size: 256 | lm loss: 2.133616E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.388 | TFLOPs: 40.55 | 15: iteration 24840/ 125429 | consumed samples: 6359040 | consumed tokens: 13023313920 | elapsed time per iteration (s): 1.03 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.105607E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.963 | TFLOPs: 41.14 | 15: iteration 24850/ 125429 | consumed samples: 6361600 | consumed tokens: 13028556800 | elapsed time per iteration (s): 1.03 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.132196E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.115 | TFLOPs: 41.00 | 15: iteration 24860/ 125429 | consumed samples: 6364160 | consumed tokens: 13033799680 | elapsed time per iteration (s): 1.04 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.082463E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.988 | TFLOPs: 40.49 | 15: iteration 24870/ 125429 | consumed samples: 6366720 | consumed tokens: 13039042560 | elapsed time per iteration (s): 1.03 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.122797E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.239 | TFLOPs: 41.19 | 15: iteration 24880/ 125429 | consumed samples: 6369280 | consumed tokens: 13044285440 | elapsed time per iteration (s): 1.04 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.111501E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.321 | TFLOPs: 40.54 | 15: iteration 24890/ 125429 | consumed samples: 6371840 | consumed tokens: 13049528320 | elapsed time per iteration (s): 1.03 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.123290E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.888 | TFLOPs: 40.97 | 15: iteration 24900/ 125429 | consumed samples: 6374400 | consumed tokens: 13054771200 | elapsed time per iteration (s): 1.04 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.100904E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.500 | TFLOPs: 40.74 | 15: iteration 24910/ 125429 | consumed samples: 6376960 | consumed tokens: 13060014080 | elapsed time per iteration (s): 1.03 | learning rate: 1.844E-04 | global batch size: 256 | lm loss: 2.099792E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.564 | TFLOPs: 41.08 | 15: iteration 24920/ 125429 | consumed samples: 6379520 | consumed tokens: 13065256960 | elapsed time per iteration (s): 1.02 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.117566E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.763 | TFLOPs: 41.44 | 15: iteration 24930/ 125429 | consumed samples: 6382080 | consumed tokens: 13070499840 | elapsed time per iteration (s): 1.06 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.118926E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.619 | TFLOPs: 39.76 | 15: iteration 24940/ 125429 | consumed samples: 6384640 | consumed tokens: 13075742720 | elapsed time per iteration (s): 1.06 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.111070E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.233 | TFLOPs: 39.87 | 15: iteration 24950/ 125429 | consumed samples: 6387200 | consumed tokens: 13080985600 | elapsed time per iteration (s): 1.03 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.120182E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.430 | TFLOPs: 40.89 | 15: iteration 24960/ 125429 | consumed samples: 6389760 | consumed tokens: 13086228480 | elapsed time per iteration (s): 1.07 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.107586E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.564 | TFLOPs: 39.42 | 15: iteration 24970/ 125429 | consumed samples: 6392320 | consumed tokens: 13091471360 | elapsed time per iteration (s): 1.07 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.100879E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.490 | TFLOPs: 39.58 | 15: iteration 24980/ 125429 | consumed samples: 6394880 | consumed tokens: 13096714240 | elapsed time per iteration (s): 1.05 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.101954E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.730 | TFLOPs: 40.11 | 15: iteration 24990/ 125429 | consumed samples: 6397440 | consumed tokens: 13101957120 | elapsed time per iteration (s): 1.08 | learning rate: 1.843E-04 | global batch size: 256 | lm loss: 2.119850E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.957 | TFLOPs: 39.32 | 15: iteration 25000/ 125429 | consumed samples: 6400000 | consumed tokens: 13107200000 | elapsed time per iteration (s): 1.05 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.119380E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.000 | TFLOPs: 40.32 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 25000 | lm loss value: 2.083086E+00 | lm loss PPL: 8.029213E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 25000 to checkpoints_1b5 0: [2022-11-26 03:16:19,150] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step25000 is begin to save! 0: [2022-11-26 03:16:19,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:16:19,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:16:19,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:16:19,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:16:19,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:16:19,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:16:19,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:16:19,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:16:19,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:16:19,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:16:19,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:16:19,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:16:19,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:16:20,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:16:20,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:16:20,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:16:20,152] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:16:20,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:16:20,260] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:16:20,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:16:20,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:16:20,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:16:20,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:16:20,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:16:20,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:16:20,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:16:20,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:16:20,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:16:20,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:16:20,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:16:20,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:16:21,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:16:21,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:16:21,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:16:21,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:16:21,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:16:21,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:16:21,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:16:21,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:16:21,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:16:21,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:16:21,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:16:21,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:16:21,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:16:21,638] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:16:21,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:16:21,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:16:21,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:16:21,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:16:21,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:16:21,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:16:22,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:16:22,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:16:22,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:16:22,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_29-model_00-model_states.pt... 0: [2022-11-26 03:16:22,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_29-model_00-model_states.pt. 0: [2022-11-26 03:16:22,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:16:22,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:16:22,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/layer_32-model_00-model_states.pt... 0: [2022-11-26 03:16:22,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/layer_32-model_00-model_states.pt. 0: [2022-11-26 03:16:22,385] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step25000/mp_rank_00_model_states.pt 0: [2022-11-26 03:16:22,385] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:16:22,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:16:22,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step25000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:16:22,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 03:16:22,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:16:22,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 03:16:22,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-26 03:16:22,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 03:16:22,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-26 03:16:22,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 03:16:22,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 03:16:22,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 03:16:22,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:16:22,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 03:16:22,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:16:22,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 03:16:22,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:16:22,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 9: [2022-11-26 03:16:22,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:16:22,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:16:22,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:16:22,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:16:22,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:16:22,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:16:22,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:16:22,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-26 03:16:22,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:16:22,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 03:16:22,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 03:16:22,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:16:22,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 14: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:16:22,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-26 03:16:22,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-26 03:16:22,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 03:16:22,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:16:22,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:16:22,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:16:22,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 15: [2022-11-26 03:16:22,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:16:22,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 03:16:22,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-26 03:16:22,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:16:22,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:16:22,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 03:16:22,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 03:16:22,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 11: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:16:22,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:16:22,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 12: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 14: [2022-11-26 03:16:22,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:16:22,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:16:22,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:16:22,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:16:22,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 3: [2022-11-26 03:16:22,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 03:16:22,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:16:22,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 13: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-26 03:16:22,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:16:22,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:16:22,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 8: [2022-11-26 03:16:22,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:16:22,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 03:16:22,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-26 03:16:22,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 10: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:16:22,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 03:16:22,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 03:16:22,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 03:16:22,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 4: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:16:22,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 03:16:22,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 03:16:22,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 11: [2022-11-26 03:16:22,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:16:22,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 03:16:22,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 9: [2022-11-26 03:16:22,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:16:22,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 03:16:22,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:16:22,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:16:22,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 5: [2022-11-26 03:16:22,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:16:22,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 03:16:22,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: [2022-11-26 03:16:22,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 03:16:22,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 2: [2022-11-26 03:16:22,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:16:22,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:16:22,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 1: [2022-11-26 03:16:22,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step25000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 6: [2022-11-26 03:16:22,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step25000 is ready now! 0: successfully saved checkpoint at iteration 25000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3741.63 15: iteration 25010/ 125429 | consumed samples: 6402560 | consumed tokens: 13112442880 | elapsed time per iteration (s): 1.43 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.107944E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.172 | TFLOPs: 29.61 | 15: iteration 25020/ 125429 | consumed samples: 6405120 | consumed tokens: 13117685760 | elapsed time per iteration (s): 1.06 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.114430E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.028 | TFLOPs: 40.00 | 15: iteration 25030/ 125429 | consumed samples: 6407680 | consumed tokens: 13122928640 | elapsed time per iteration (s): 1.05 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.116418E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.324 | TFLOPs: 40.38 | 15: iteration 25040/ 125429 | consumed samples: 6410240 | consumed tokens: 13128171520 | elapsed time per iteration (s): 1.04 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.135729E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.173 | TFLOPs: 40.85 | 15: iteration 25050/ 125429 | consumed samples: 6412800 | consumed tokens: 13133414400 | elapsed time per iteration (s): 1.06 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.140296E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.426 | TFLOPs: 39.73 | 15: iteration 25060/ 125429 | consumed samples: 6415360 | consumed tokens: 13138657280 | elapsed time per iteration (s): 1.02 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.128647E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.546 | TFLOPs: 41.40 | 15: iteration 25070/ 125429 | consumed samples: 6417920 | consumed tokens: 13143900160 | elapsed time per iteration (s): 1.03 | learning rate: 1.842E-04 | global batch size: 256 | lm loss: 2.187430E+00 | grad norm: 6.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.244 | TFLOPs: 41.02 | 15: iteration 25080/ 125429 | consumed samples: 6420480 | consumed tokens: 13149143040 | elapsed time per iteration (s): 1.10 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.838750E+00 | grad norm: 0.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.694 | TFLOPs: 38.45 | 15: iteration 25090/ 125429 | consumed samples: 6423040 | consumed tokens: 13154385920 | elapsed time per iteration (s): 1.04 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.203713E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.992 | TFLOPs: 40.49 | 15: iteration 25100/ 125429 | consumed samples: 6425600 | consumed tokens: 13159628800 | elapsed time per iteration (s): 1.03 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.167751E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.719 | TFLOPs: 40.94 | 15: iteration 25110/ 125429 | consumed samples: 6428160 | consumed tokens: 13164871680 | elapsed time per iteration (s): 1.08 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.146320E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.403 | TFLOPs: 39.07 | 15: iteration 25120/ 125429 | consumed samples: 6430720 | consumed tokens: 13170114560 | elapsed time per iteration (s): 1.03 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.127820E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.647 | TFLOPs: 40.93 | 15: iteration 25130/ 125429 | consumed samples: 6433280 | consumed tokens: 13175357440 | elapsed time per iteration (s): 1.05 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.135852E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.672 | TFLOPs: 40.43 | 15: iteration 25140/ 125429 | consumed samples: 6435840 | consumed tokens: 13180600320 | elapsed time per iteration (s): 1.05 | learning rate: 1.841E-04 | global batch size: 256 | lm loss: 2.129211E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.653 | TFLOPs: 40.27 | 15: iteration 25150/ 125429 | consumed samples: 6438400 | consumed tokens: 13185843200 | elapsed time per iteration (s): 1.04 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.144076E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.373 | TFLOPs: 40.71 | 15: iteration 25160/ 125429 | consumed samples: 6440960 | consumed tokens: 13191086080 | elapsed time per iteration (s): 1.05 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.143922E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.600 | TFLOPs: 40.42 | 15: iteration 25170/ 125429 | consumed samples: 6443520 | consumed tokens: 13196328960 | elapsed time per iteration (s): 1.05 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.130585E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.732 | TFLOPs: 40.11 | 15: iteration 25180/ 125429 | consumed samples: 6446080 | consumed tokens: 13201571840 | elapsed time per iteration (s): 1.03 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.122349E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.947 | TFLOPs: 40.98 | 15: iteration 25190/ 125429 | consumed samples: 6448640 | consumed tokens: 13206814720 | elapsed time per iteration (s): 1.03 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.141952E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.903 | TFLOPs: 40.97 | 15: iteration 25200/ 125429 | consumed samples: 6451200 | consumed tokens: 13212057600 | elapsed time per iteration (s): 1.07 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.128514E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.666 | TFLOPs: 39.61 | 15: iteration 25210/ 125429 | consumed samples: 6453760 | consumed tokens: 13217300480 | elapsed time per iteration (s): 1.06 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.138694E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.747 | TFLOPs: 39.79 | 15: iteration 25220/ 125429 | consumed samples: 6456320 | consumed tokens: 13222543360 | elapsed time per iteration (s): 1.04 | learning rate: 1.840E-04 | global batch size: 256 | lm loss: 2.296928E+00 | grad norm: 1.057 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.619 | TFLOPs: 40.76 | 15: iteration 25230/ 125429 | consumed samples: 6458880 | consumed tokens: 13227786240 | elapsed time per iteration (s): 1.05 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.180644E+00 | grad norm: 0.200 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.644 | TFLOPs: 40.26 | 15: iteration 25240/ 125429 | consumed samples: 6461440 | consumed tokens: 13233029120 | elapsed time per iteration (s): 1.03 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.165024E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.850 | TFLOPs: 41.12 | 15: iteration 25250/ 125429 | consumed samples: 6464000 | consumed tokens: 13238272000 | elapsed time per iteration (s): 1.07 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.136854E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.937 | TFLOPs: 39.65 | 15: iteration 25260/ 125429 | consumed samples: 6466560 | consumed tokens: 13243514880 | elapsed time per iteration (s): 1.04 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.134925E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.121 | TFLOPs: 40.51 | 15: iteration 25270/ 125429 | consumed samples: 6469120 | consumed tokens: 13248757760 | elapsed time per iteration (s): 1.05 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.153051E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.972 | TFLOPs: 40.32 | 15: iteration 25280/ 125429 | consumed samples: 6471680 | consumed tokens: 13254000640 | elapsed time per iteration (s): 1.02 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.155932E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.949 | TFLOPs: 41.31 | 15: iteration 25290/ 125429 | consumed samples: 6474240 | consumed tokens: 13259243520 | elapsed time per iteration (s): 1.04 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.080091E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.657 | TFLOPs: 40.76 | 15: iteration 25300/ 125429 | consumed samples: 6476800 | consumed tokens: 13264486400 | elapsed time per iteration (s): 1.02 | learning rate: 1.839E-04 | global batch size: 256 | lm loss: 2.140214E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.929 | TFLOPs: 41.47 | 15: iteration 25310/ 125429 | consumed samples: 6479360 | consumed tokens: 13269729280 | elapsed time per iteration (s): 1.03 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.119681E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.091 | TFLOPs: 41.00 | 15: iteration 25320/ 125429 | consumed samples: 6481920 | consumed tokens: 13274972160 | elapsed time per iteration (s): 1.05 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.125146E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.284 | TFLOPs: 40.20 | 15: iteration 25330/ 125429 | consumed samples: 6484480 | consumed tokens: 13280215040 | elapsed time per iteration (s): 1.05 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.124413E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.963 | TFLOPs: 40.32 | 15: iteration 25340/ 125429 | consumed samples: 6487040 | consumed tokens: 13285457920 | elapsed time per iteration (s): 1.13 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.128864E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.014 | TFLOPs: 37.52 | 15: iteration 25350/ 125429 | consumed samples: 6489600 | consumed tokens: 13290700800 | elapsed time per iteration (s): 1.04 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.117952E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.993 | TFLOPs: 40.82 | 15: iteration 25360/ 125429 | consumed samples: 6492160 | consumed tokens: 13295943680 | elapsed time per iteration (s): 1.04 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.116300E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.176 | TFLOPs: 40.68 | 15: iteration 25370/ 125429 | consumed samples: 6494720 | consumed tokens: 13301186560 | elapsed time per iteration (s): 1.09 | learning rate: 1.838E-04 | global batch size: 256 | lm loss: 2.108427E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.747 | TFLOPs: 38.96 | 15: iteration 25380/ 125429 | consumed samples: 6497280 | consumed tokens: 13306429440 | elapsed time per iteration (s): 1.08 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.137561E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.592 | TFLOPs: 39.26 | 15: iteration 25390/ 125429 | consumed samples: 6499840 | consumed tokens: 13311672320 | elapsed time per iteration (s): 1.04 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.156029E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.744 | TFLOPs: 40.61 | 15: iteration 25400/ 125429 | consumed samples: 6502400 | consumed tokens: 13316915200 | elapsed time per iteration (s): 1.04 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.133992E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.217 | TFLOPs: 40.52 | 15: iteration 25410/ 125429 | consumed samples: 6504960 | consumed tokens: 13322158080 | elapsed time per iteration (s): 1.07 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.095297E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.942 | TFLOPs: 39.49 | 15: iteration 25420/ 125429 | consumed samples: 6507520 | consumed tokens: 13327400960 | elapsed time per iteration (s): 1.03 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.105207E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.647 | TFLOPs: 40.93 | 15: iteration 25430/ 125429 | consumed samples: 6510080 | consumed tokens: 13332643840 | elapsed time per iteration (s): 1.07 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.141641E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.536 | TFLOPs: 39.59 | 15: iteration 25440/ 125429 | consumed samples: 6512640 | consumed tokens: 13337886720 | elapsed time per iteration (s): 1.05 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.110751E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.164 | TFLOPs: 40.18 | 15: iteration 25450/ 125429 | consumed samples: 6515200 | consumed tokens: 13343129600 | elapsed time per iteration (s): 1.05 | learning rate: 1.837E-04 | global batch size: 256 | lm loss: 2.116617E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.638 | TFLOPs: 40.26 | 15: iteration 25460/ 125429 | consumed samples: 6517760 | consumed tokens: 13348372480 | elapsed time per iteration (s): 1.03 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.142514E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.496 | TFLOPs: 41.07 | 15: iteration 25470/ 125429 | consumed samples: 6520320 | consumed tokens: 13353615360 | elapsed time per iteration (s): 1.07 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.129232E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.399 | TFLOPs: 39.56 | 15: iteration 25480/ 125429 | consumed samples: 6522880 | consumed tokens: 13358858240 | elapsed time per iteration (s): 1.07 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.127795E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.679 | TFLOPs: 39.44 | 15: iteration 25490/ 125429 | consumed samples: 6525440 | consumed tokens: 13364101120 | elapsed time per iteration (s): 1.03 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.097875E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.641 | TFLOPs: 41.09 | 15: iteration 25500/ 125429 | consumed samples: 6528000 | consumed tokens: 13369344000 | elapsed time per iteration (s): 1.04 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.120527E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.675 | TFLOPs: 40.77 | 15: iteration 25510/ 125429 | consumed samples: 6530560 | consumed tokens: 13374586880 | elapsed time per iteration (s): 1.04 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.090278E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.299 | TFLOPs: 40.87 | 15: iteration 25520/ 125429 | consumed samples: 6533120 | consumed tokens: 13379829760 | elapsed time per iteration (s): 1.05 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.108945E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.181 | TFLOPs: 40.35 | 15: iteration 25530/ 125429 | consumed samples: 6535680 | consumed tokens: 13385072640 | elapsed time per iteration (s): 1.03 | learning rate: 1.836E-04 | global batch size: 256 | lm loss: 2.115510E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.212 | TFLOPs: 41.02 | 15: iteration 25540/ 125429 | consumed samples: 6538240 | consumed tokens: 13390315520 | elapsed time per iteration (s): 1.05 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.113953E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.085 | TFLOPs: 40.34 | 15: iteration 25550/ 125429 | consumed samples: 6540800 | consumed tokens: 13395558400 | elapsed time per iteration (s): 1.03 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.114088E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.114 | TFLOPs: 41.00 | 15: iteration 25560/ 125429 | consumed samples: 6543360 | consumed tokens: 13400801280 | elapsed time per iteration (s): 1.04 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.126593E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.924 | TFLOPs: 40.81 | 15: iteration 25570/ 125429 | consumed samples: 6545920 | consumed tokens: 13406044160 | elapsed time per iteration (s): 1.05 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.110173E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.719 | TFLOPs: 40.44 | 15: iteration 25580/ 125429 | consumed samples: 6548480 | consumed tokens: 13411287040 | elapsed time per iteration (s): 1.05 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.095975E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.902 | TFLOPs: 40.47 | 15: iteration 25590/ 125429 | consumed samples: 6551040 | consumed tokens: 13416529920 | elapsed time per iteration (s): 1.05 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.103199E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.953 | TFLOPs: 40.15 | 15: iteration 25600/ 125429 | consumed samples: 6553600 | consumed tokens: 13421772800 | elapsed time per iteration (s): 1.05 | learning rate: 1.835E-04 | global batch size: 256 | lm loss: 2.104113E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.263 | TFLOPs: 40.37 | 15: iteration 25610/ 125429 | consumed samples: 6556160 | consumed tokens: 13427015680 | elapsed time per iteration (s): 1.06 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.118715E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.388 | TFLOPs: 39.73 | 15: iteration 25620/ 125429 | consumed samples: 6558720 | consumed tokens: 13432258560 | elapsed time per iteration (s): 1.07 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.143866E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.628 | TFLOPs: 39.60 | 15: iteration 25630/ 125429 | consumed samples: 6561280 | consumed tokens: 13437501440 | elapsed time per iteration (s): 1.03 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.138225E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.123 | TFLOPs: 41.17 | 15: iteration 25640/ 125429 | consumed samples: 6563840 | consumed tokens: 13442744320 | elapsed time per iteration (s): 1.05 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.119616E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.071 | TFLOPs: 40.33 | 15: iteration 25650/ 125429 | consumed samples: 6566400 | consumed tokens: 13447987200 | elapsed time per iteration (s): 1.03 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.134989E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.000 | TFLOPs: 41.15 | 15: iteration 25660/ 125429 | consumed samples: 6568960 | consumed tokens: 13453230080 | elapsed time per iteration (s): 1.04 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.135222E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.846 | TFLOPs: 40.63 | 15: iteration 25670/ 125429 | consumed samples: 6571520 | consumed tokens: 13458472960 | elapsed time per iteration (s): 1.04 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.126006E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.513 | TFLOPs: 40.74 | 15: iteration 25680/ 125429 | consumed samples: 6574080 | consumed tokens: 13463715840 | elapsed time per iteration (s): 1.06 | learning rate: 1.834E-04 | global batch size: 256 | lm loss: 2.109348E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.982 | TFLOPs: 39.99 | 15: iteration 25690/ 125429 | consumed samples: 6576640 | consumed tokens: 13468958720 | elapsed time per iteration (s): 1.02 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.119382E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.893 | TFLOPs: 41.30 | 15: iteration 25700/ 125429 | consumed samples: 6579200 | consumed tokens: 13474201600 | elapsed time per iteration (s): 1.07 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.136138E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.529 | TFLOPs: 39.42 | 15: iteration 25710/ 125429 | consumed samples: 6581760 | consumed tokens: 13479444480 | elapsed time per iteration (s): 1.05 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.093846E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.846 | TFLOPs: 40.30 | 15: iteration 25720/ 125429 | consumed samples: 6584320 | consumed tokens: 13484687360 | elapsed time per iteration (s): 1.05 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.117026E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.608 | TFLOPs: 40.42 | 15: iteration 25730/ 125429 | consumed samples: 6586880 | consumed tokens: 13489930240 | elapsed time per iteration (s): 1.03 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.114793E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.209 | TFLOPs: 41.02 | 15: iteration 25740/ 125429 | consumed samples: 6589440 | consumed tokens: 13495173120 | elapsed time per iteration (s): 1.04 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.119028E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.010 | TFLOPs: 40.82 | 15: iteration 25750/ 125429 | consumed samples: 6592000 | consumed tokens: 13500416000 | elapsed time per iteration (s): 1.03 | learning rate: 1.833E-04 | global batch size: 256 | lm loss: 2.121656E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.823 | TFLOPs: 40.95 | 15: iteration 25760/ 125429 | consumed samples: 6594560 | consumed tokens: 13505658880 | elapsed time per iteration (s): 1.08 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.146324E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.768 | TFLOPs: 39.13 | 15: iteration 25770/ 125429 | consumed samples: 6597120 | consumed tokens: 13510901760 | elapsed time per iteration (s): 1.04 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.120406E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.069 | TFLOPs: 40.66 | 15: iteration 25780/ 125429 | consumed samples: 6599680 | consumed tokens: 13516144640 | elapsed time per iteration (s): 1.05 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.107700E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.687 | TFLOPs: 40.27 | 15: iteration 25790/ 125429 | consumed samples: 6602240 | consumed tokens: 13521387520 | elapsed time per iteration (s): 1.03 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.138597E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.466 | TFLOPs: 40.90 | 15: iteration 25800/ 125429 | consumed samples: 6604800 | consumed tokens: 13526630400 | elapsed time per iteration (s): 1.03 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.114920E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.366 | TFLOPs: 41.04 | 15: iteration 25810/ 125429 | consumed samples: 6607360 | consumed tokens: 13531873280 | elapsed time per iteration (s): 1.08 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.137096E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.827 | TFLOPs: 39.30 | 15: iteration 25820/ 125429 | consumed samples: 6609920 | consumed tokens: 13537116160 | elapsed time per iteration (s): 1.07 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.133250E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.218 | TFLOPs: 39.53 | 15: iteration 25830/ 125429 | consumed samples: 6612480 | consumed tokens: 13542359040 | elapsed time per iteration (s): 1.04 | learning rate: 1.832E-04 | global batch size: 256 | lm loss: 2.119539E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.212 | TFLOPs: 40.69 | 15: iteration 25840/ 125429 | consumed samples: 6615040 | consumed tokens: 13547601920 | elapsed time per iteration (s): 1.04 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.115781E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.498 | TFLOPs: 40.74 | 15: iteration 25850/ 125429 | consumed samples: 6617600 | consumed tokens: 13552844800 | elapsed time per iteration (s): 1.05 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.135161E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.461 | TFLOPs: 40.23 | 15: iteration 25860/ 125429 | consumed samples: 6620160 | consumed tokens: 13558087680 | elapsed time per iteration (s): 1.03 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.129449E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.391 | TFLOPs: 41.05 | 15: iteration 25870/ 125429 | consumed samples: 6622720 | consumed tokens: 13563330560 | elapsed time per iteration (s): 1.05 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.117177E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.431 | TFLOPs: 40.23 | 15: iteration 25880/ 125429 | consumed samples: 6625280 | consumed tokens: 13568573440 | elapsed time per iteration (s): 3.35 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.094686E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 76.371 | TFLOPs: 12.62 | 15: iteration 25890/ 125429 | consumed samples: 6627840 | consumed tokens: 13573816320 | elapsed time per iteration (s): 1.02 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.101286E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.544 | TFLOPs: 41.40 | 15: iteration 25900/ 125429 | consumed samples: 6630400 | consumed tokens: 13579059200 | elapsed time per iteration (s): 1.05 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.126290E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.879 | TFLOPs: 40.14 | 15: iteration 25910/ 125429 | consumed samples: 6632960 | consumed tokens: 13584302080 | elapsed time per iteration (s): 1.05 | learning rate: 1.831E-04 | global batch size: 256 | lm loss: 2.123317E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.004 | TFLOPs: 40.16 | 15: iteration 25920/ 125429 | consumed samples: 6635520 | consumed tokens: 13589544960 | elapsed time per iteration (s): 1.05 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.109417E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.565 | TFLOPs: 40.25 | 15: iteration 25930/ 125429 | consumed samples: 6638080 | consumed tokens: 13594787840 | elapsed time per iteration (s): 1.09 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.124242E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.055 | TFLOPs: 38.68 | 15: iteration 25940/ 125429 | consumed samples: 6640640 | consumed tokens: 13600030720 | elapsed time per iteration (s): 1.06 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.105964E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.193 | TFLOPs: 40.02 | 15: iteration 25950/ 125429 | consumed samples: 6643200 | consumed tokens: 13605273600 | elapsed time per iteration (s): 1.03 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.125187E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.817 | TFLOPs: 41.12 | 15: iteration 25960/ 125429 | consumed samples: 6645760 | consumed tokens: 13610516480 | elapsed time per iteration (s): 1.06 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.110481E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.091 | TFLOPs: 40.01 | 15: iteration 25970/ 125429 | consumed samples: 6648320 | consumed tokens: 13615759360 | elapsed time per iteration (s): 1.03 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.127853E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.049 | TFLOPs: 41.16 | 15: iteration 25980/ 125429 | consumed samples: 6650880 | consumed tokens: 13621002240 | elapsed time per iteration (s): 1.09 | learning rate: 1.830E-04 | global batch size: 256 | lm loss: 2.098660E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.794 | TFLOPs: 38.97 | 15: iteration 25990/ 125429 | consumed samples: 6653440 | consumed tokens: 13626245120 | elapsed time per iteration (s): 1.06 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.129873E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.515 | TFLOPs: 39.91 | 0: [2022-11-26 03:34:14,809] [INFO] [logging.py:68:log_dist] [Rank 0] step=26000, skipped=0, lr=[0.00018293078433159502, 0.00018293078433159502, 0.00018293078433159502], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 26000/ 125429 | consumed samples: 6656000 | consumed tokens: 13631488000 | elapsed time per iteration (s): 1.10 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.136197E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.529 | TFLOPs: 38.59 | 0: steps: 26000 loss: 2.1722 iter time (s): 1.065 samples/sec: 240.419 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 26000 | lm loss value: 2.039505E+00 | lm loss PPL: 7.686801E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 26000 to checkpoints_1b5 0: [2022-11-26 03:34:15,261] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step26000 is begin to save! 0: [2022-11-26 03:34:15,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:34:15,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:34:15,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:34:15,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:34:15,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:34:15,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:34:15,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:34:15,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:34:15,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:34:15,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:34:15,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:34:16,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:34:16,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:34:16,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:34:16,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:34:16,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:34:16,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:34:16,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:34:16,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:34:16,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:34:16,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:34:16,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:34:16,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:34:16,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:34:16,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:34:16,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:34:16,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:34:16,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:34:16,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:34:17,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:34:17,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:34:17,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:34:17,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:34:17,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:34:17,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:34:17,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:34:17,383] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:34:17,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:34:17,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:34:17,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:34:17,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:34:17,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:34:17,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:34:17,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:34:17,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:34:17,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:34:17,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:34:18,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:34:18,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:34:18,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:34:18,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:34:18,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:34:18,298] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:34:18,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:34:18,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_29-model_00-model_states.pt... 0: [2022-11-26 03:34:18,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_29-model_00-model_states.pt. 0: [2022-11-26 03:34:18,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:34:18,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:34:18,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/layer_32-model_00-model_states.pt... 0: [2022-11-26 03:34:18,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/layer_32-model_00-model_states.pt. 0: [2022-11-26 03:34:18,639] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step26000/mp_rank_00_model_states.pt 0: [2022-11-26 03:34:18,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:34:18,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:34:18,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:34:18,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 03:34:18,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:34:18,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:34:18,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:18,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:34:18,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:18,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 03:34:18,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 03:34:18,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 03:34:18,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:34:18,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:34:18,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:18,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 03:34:18,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 03:34:18,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:34:18,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:18,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:34:18,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 03:34:18,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 03:34:18,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 03:34:18,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 03:34:18,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:34:18,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:34:18,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:34:18,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:34:18,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:34:18,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:34:18,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:34:18,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:34:18,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 03:34:18,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 03:34:18,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:34:18,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:34:18,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 5: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:18,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:34:18,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:18,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 7: [2022-11-26 03:34:18,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:34:18,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 03:34:18,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 3: [2022-11-26 03:34:18,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:34:18,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 03:34:18,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 03:34:18,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 1: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-26 03:34:18,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:34:18,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 03:34:18,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:34:18,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:34:18,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:34:18,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 14: [2022-11-26 03:34:18,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:34:18,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 03:34:18,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:34:18,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:34:18,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 13: [2022-11-26 03:34:18,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:34:18,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 03:34:18,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:34:18,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:34:18,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:34:18,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 11: [2022-11-26 03:34:18,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:34:18,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 03:34:18,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 2: [2022-11-26 03:34:18,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:34:18,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:34:18,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 12: [2022-11-26 03:34:18,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:34:18,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 03:34:18,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:34:18,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:34:18,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 4: [2022-11-26 03:34:18,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:34:18,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 03:34:18,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: [2022-11-26 03:34:18,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 03:34:18,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 6: [2022-11-26 03:34:18,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:34:18,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:34:18,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-26 03:34:18,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-26 03:34:18,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 9: [2022-11-26 03:34:18,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:34:18,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 03:34:18,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:34:18,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 10: [2022-11-26 03:34:18,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: [2022-11-26 03:34:19,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 03:34:19,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-26 03:34:19,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-26 03:34:19,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-26 03:34:19,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-26 03:34:19,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-26 03:34:19,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 0: successfully saved checkpoint at iteration 26000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3828.16 8: [2022-11-26 03:34:19,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:34:19,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-26 03:34:19,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step26000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 03:34:19,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 8: [2022-11-26 03:34:19,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step26000 is ready now! 15: iteration 26010/ 125429 | consumed samples: 6658560 | consumed tokens: 13636730880 | elapsed time per iteration (s): 1.43 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.100821E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.477 | TFLOPs: 29.49 | 15: iteration 26020/ 125429 | consumed samples: 6661120 | consumed tokens: 13641973760 | elapsed time per iteration (s): 1.09 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.118583E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.433 | TFLOPs: 38.74 | 15: iteration 26030/ 125429 | consumed samples: 6663680 | consumed tokens: 13647216640 | elapsed time per iteration (s): 1.05 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.143570E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.802 | TFLOPs: 40.46 | 15: iteration 26040/ 125429 | consumed samples: 6666240 | consumed tokens: 13652459520 | elapsed time per iteration (s): 1.06 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.102504E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.161 | TFLOPs: 40.02 | 15: iteration 26050/ 125429 | consumed samples: 6668800 | consumed tokens: 13657702400 | elapsed time per iteration (s): 1.06 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.115682E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.308 | TFLOPs: 39.88 | 15: iteration 26060/ 125429 | consumed samples: 6671360 | consumed tokens: 13662945280 | elapsed time per iteration (s): 1.05 | learning rate: 1.829E-04 | global batch size: 256 | lm loss: 2.145904E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.725 | TFLOPs: 40.28 | 15: iteration 26070/ 125429 | consumed samples: 6673920 | consumed tokens: 13668188160 | elapsed time per iteration (s): 1.08 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.108965E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.474 | TFLOPs: 39.08 | 15: iteration 26080/ 125429 | consumed samples: 6676480 | consumed tokens: 13673431040 | elapsed time per iteration (s): 1.06 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.117371E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.826 | TFLOPs: 39.96 | 15: iteration 26090/ 125429 | consumed samples: 6679040 | consumed tokens: 13678673920 | elapsed time per iteration (s): 1.09 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.097567E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.531 | TFLOPs: 38.76 | 15: iteration 26100/ 125429 | consumed samples: 6681600 | consumed tokens: 13683916800 | elapsed time per iteration (s): 1.03 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.121210E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.909 | TFLOPs: 40.97 | 15: iteration 26110/ 125429 | consumed samples: 6684160 | consumed tokens: 13689159680 | elapsed time per iteration (s): 1.04 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.122779E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.050 | TFLOPs: 40.50 | 15: iteration 26120/ 125429 | consumed samples: 6686720 | consumed tokens: 13694402560 | elapsed time per iteration (s): 1.04 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.110279E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.671 | TFLOPs: 40.76 | 15: iteration 26130/ 125429 | consumed samples: 6689280 | consumed tokens: 13699645440 | elapsed time per iteration (s): 1.03 | learning rate: 1.828E-04 | global batch size: 256 | lm loss: 2.109310E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.913 | TFLOPs: 41.13 | 15: iteration 26140/ 125429 | consumed samples: 6691840 | consumed tokens: 13704888320 | elapsed time per iteration (s): 1.05 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.106789E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.112 | TFLOPs: 40.34 | 15: iteration 26150/ 125429 | consumed samples: 6694400 | consumed tokens: 13710131200 | elapsed time per iteration (s): 1.05 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.099691E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.796 | TFLOPs: 40.45 | 15: iteration 26160/ 125429 | consumed samples: 6696960 | consumed tokens: 13715374080 | elapsed time per iteration (s): 1.04 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.079520E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.172 | TFLOPs: 40.85 | 15: iteration 26170/ 125429 | consumed samples: 6699520 | consumed tokens: 13720616960 | elapsed time per iteration (s): 1.07 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.112425E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.239 | TFLOPs: 39.54 | 15: iteration 26180/ 125429 | consumed samples: 6702080 | consumed tokens: 13725859840 | elapsed time per iteration (s): 1.05 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.109650E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.467 | TFLOPs: 40.23 | 15: iteration 26190/ 125429 | consumed samples: 6704640 | consumed tokens: 13731102720 | elapsed time per iteration (s): 1.05 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.113158E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.742 | TFLOPs: 40.45 | 15: iteration 26200/ 125429 | consumed samples: 6707200 | consumed tokens: 13736345600 | elapsed time per iteration (s): 1.04 | learning rate: 1.827E-04 | global batch size: 256 | lm loss: 2.119479E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.272 | TFLOPs: 40.86 | 15: iteration 26210/ 125429 | consumed samples: 6709760 | consumed tokens: 13741588480 | elapsed time per iteration (s): 1.06 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.118917E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.597 | TFLOPs: 39.76 | 15: iteration 26220/ 125429 | consumed samples: 6712320 | consumed tokens: 13746831360 | elapsed time per iteration (s): 1.03 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.139302E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.680 | TFLOPs: 41.10 | 15: iteration 26230/ 125429 | consumed samples: 6714880 | consumed tokens: 13752074240 | elapsed time per iteration (s): 1.07 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.102080E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.459 | TFLOPs: 39.57 | 15: iteration 26240/ 125429 | consumed samples: 6717440 | consumed tokens: 13757317120 | elapsed time per iteration (s): 1.08 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.100050E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.388 | TFLOPs: 39.06 | 15: iteration 26250/ 125429 | consumed samples: 6720000 | consumed tokens: 13762560000 | elapsed time per iteration (s): 1.05 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.099505E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.565 | TFLOPs: 40.25 | 15: iteration 26260/ 125429 | consumed samples: 6722560 | consumed tokens: 13767802880 | elapsed time per iteration (s): 2.45 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.102029E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.695 | TFLOPs: 17.30 | 15: iteration 26270/ 125429 | consumed samples: 6725120 | consumed tokens: 13773045760 | elapsed time per iteration (s): 1.05 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.105526E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.074 | TFLOPs: 40.34 | 15: iteration 26280/ 125429 | consumed samples: 6727680 | consumed tokens: 13778288640 | elapsed time per iteration (s): 1.05 | learning rate: 1.826E-04 | global batch size: 256 | lm loss: 2.111688E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.985 | TFLOPs: 40.32 | 15: iteration 26290/ 125429 | consumed samples: 6730240 | consumed tokens: 13783531520 | elapsed time per iteration (s): 1.07 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.130585E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.956 | TFLOPs: 39.49 | 15: iteration 26300/ 125429 | consumed samples: 6732800 | consumed tokens: 13788774400 | elapsed time per iteration (s): 1.06 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.115579E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.549 | TFLOPs: 40.08 | 15: iteration 26310/ 125429 | consumed samples: 6735360 | consumed tokens: 13794017280 | elapsed time per iteration (s): 1.05 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.120744E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.580 | TFLOPs: 40.42 | 15: iteration 26320/ 125429 | consumed samples: 6737920 | consumed tokens: 13799260160 | elapsed time per iteration (s): 1.05 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.122669E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.715 | TFLOPs: 40.28 | 15: iteration 26330/ 125429 | consumed samples: 6740480 | consumed tokens: 13804503040 | elapsed time per iteration (s): 1.11 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.146373E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.399 | TFLOPs: 38.24 | 15: iteration 26340/ 125429 | consumed samples: 6743040 | consumed tokens: 13809745920 | elapsed time per iteration (s): 1.10 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.130251E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.263 | TFLOPs: 38.38 | 15: iteration 26350/ 125429 | consumed samples: 6745600 | consumed tokens: 13814988800 | elapsed time per iteration (s): 1.03 | learning rate: 1.825E-04 | global batch size: 256 | lm loss: 2.113016E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.748 | TFLOPs: 41.27 | 15: iteration 26360/ 125429 | consumed samples: 6748160 | consumed tokens: 13820231680 | elapsed time per iteration (s): 7.54 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.081801E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 33.935 | TFLOPs: 5.61 | 15: iteration 26370/ 125429 | consumed samples: 6750720 | consumed tokens: 13825474560 | elapsed time per iteration (s): 1.05 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.119996E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.381 | TFLOPs: 40.22 | 15: iteration 26380/ 125429 | consumed samples: 6753280 | consumed tokens: 13830717440 | elapsed time per iteration (s): 1.06 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.128878E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.052 | TFLOPs: 40.00 | 15: iteration 26390/ 125429 | consumed samples: 6755840 | consumed tokens: 13835960320 | elapsed time per iteration (s): 1.04 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.140005E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.331 | TFLOPs: 40.87 | 15: iteration 26400/ 125429 | consumed samples: 6758400 | consumed tokens: 13841203200 | elapsed time per iteration (s): 1.02 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.101064E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.923 | TFLOPs: 41.30 | 15: iteration 26410/ 125429 | consumed samples: 6760960 | consumed tokens: 13846446080 | elapsed time per iteration (s): 1.04 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.137013E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.602 | TFLOPs: 40.75 | 15: iteration 26420/ 125429 | consumed samples: 6763520 | consumed tokens: 13851688960 | elapsed time per iteration (s): 1.03 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.101601E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.464 | TFLOPs: 40.90 | 15: iteration 26430/ 125429 | consumed samples: 6766080 | consumed tokens: 13856931840 | elapsed time per iteration (s): 1.04 | learning rate: 1.824E-04 | global batch size: 256 | lm loss: 2.069405E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.291 | TFLOPs: 40.70 | 15: iteration 26440/ 125429 | consumed samples: 6768640 | consumed tokens: 13862174720 | elapsed time per iteration (s): 1.12 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.165540E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.411 | TFLOPs: 37.75 | 15: iteration 26450/ 125429 | consumed samples: 6771200 | consumed tokens: 13867417600 | elapsed time per iteration (s): 1.07 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.102932E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.317 | TFLOPs: 39.55 | 15: iteration 26460/ 125429 | consumed samples: 6773760 | consumed tokens: 13872660480 | elapsed time per iteration (s): 1.07 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.115471E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.651 | TFLOPs: 39.60 | 15: iteration 26470/ 125429 | consumed samples: 6776320 | consumed tokens: 13877903360 | elapsed time per iteration (s): 1.06 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.095522E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.180 | TFLOPs: 39.86 | 15: iteration 26480/ 125429 | consumed samples: 6778880 | consumed tokens: 13883146240 | elapsed time per iteration (s): 1.49 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.104650E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.975 | TFLOPs: 28.42 | 15: iteration 26490/ 125429 | consumed samples: 6781440 | consumed tokens: 13888389120 | elapsed time per iteration (s): 1.02 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.097619E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.445 | TFLOPs: 41.39 | 15: iteration 26500/ 125429 | consumed samples: 6784000 | consumed tokens: 13893632000 | elapsed time per iteration (s): 1.07 | learning rate: 1.823E-04 | global batch size: 256 | lm loss: 2.113928E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.044 | TFLOPs: 39.50 | 15: iteration 26510/ 125429 | consumed samples: 6786560 | consumed tokens: 13898874880 | elapsed time per iteration (s): 1.09 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.119959E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.252 | TFLOPs: 38.88 | 15: iteration 26520/ 125429 | consumed samples: 6789120 | consumed tokens: 13904117760 | elapsed time per iteration (s): 1.08 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.116124E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.098 | TFLOPs: 39.18 | 15: iteration 26530/ 125429 | consumed samples: 6791680 | consumed tokens: 13909360640 | elapsed time per iteration (s): 1.08 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.115475E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.874 | TFLOPs: 39.31 | 15: iteration 26540/ 125429 | consumed samples: 6794240 | consumed tokens: 13914603520 | elapsed time per iteration (s): 1.06 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.098178E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.117 | TFLOPs: 40.01 | 15: iteration 26550/ 125429 | consumed samples: 6796800 | consumed tokens: 13919846400 | elapsed time per iteration (s): 1.05 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.128131E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.880 | TFLOPs: 40.47 | 15: iteration 26560/ 125429 | consumed samples: 6799360 | consumed tokens: 13925089280 | elapsed time per iteration (s): 1.09 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.109254E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.848 | TFLOPs: 38.98 | 15: iteration 26570/ 125429 | consumed samples: 6801920 | consumed tokens: 13930332160 | elapsed time per iteration (s): 1.03 | learning rate: 1.822E-04 | global batch size: 256 | lm loss: 2.077453E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.034 | TFLOPs: 40.99 | 15: iteration 26580/ 125429 | consumed samples: 6804480 | consumed tokens: 13935575040 | elapsed time per iteration (s): 1.07 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.119624E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.884 | TFLOPs: 39.64 | 15: iteration 26590/ 125429 | consumed samples: 6807040 | consumed tokens: 13940817920 | elapsed time per iteration (s): 1.05 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.095820E+00 | grad norm: 0.225 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.589 | TFLOPs: 40.42 | 15: iteration 26600/ 125429 | consumed samples: 6809600 | consumed tokens: 13946060800 | elapsed time per iteration (s): 1.05 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.110104E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.219 | TFLOPs: 40.36 | 15: iteration 26610/ 125429 | consumed samples: 6812160 | consumed tokens: 13951303680 | elapsed time per iteration (s): 1.06 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.108776E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.939 | TFLOPs: 39.82 | 15: iteration 26620/ 125429 | consumed samples: 6814720 | consumed tokens: 13956546560 | elapsed time per iteration (s): 1.05 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.114982E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.582 | TFLOPs: 40.25 | 15: iteration 26630/ 125429 | consumed samples: 6817280 | consumed tokens: 13961789440 | elapsed time per iteration (s): 1.03 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.109737E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.909 | TFLOPs: 40.97 | 15: iteration 26640/ 125429 | consumed samples: 6819840 | consumed tokens: 13967032320 | elapsed time per iteration (s): 1.04 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.122629E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.254 | TFLOPs: 40.53 | 15: iteration 26650/ 125429 | consumed samples: 6822400 | consumed tokens: 13972275200 | elapsed time per iteration (s): 1.04 | learning rate: 1.821E-04 | global batch size: 256 | lm loss: 2.097222E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.941 | TFLOPs: 40.64 | 15: iteration 26660/ 125429 | consumed samples: 6824960 | consumed tokens: 13977518080 | elapsed time per iteration (s): 1.07 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.093467E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.581 | TFLOPs: 39.43 | 15: iteration 26670/ 125429 | consumed samples: 6827520 | consumed tokens: 13982760960 | elapsed time per iteration (s): 1.02 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.129411E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.948 | TFLOPs: 41.47 | 15: iteration 26680/ 125429 | consumed samples: 6830080 | consumed tokens: 13988003840 | elapsed time per iteration (s): 1.04 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.139704E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.652 | TFLOPs: 40.60 | 15: iteration 26690/ 125429 | consumed samples: 6832640 | consumed tokens: 13993246720 | elapsed time per iteration (s): 1.10 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.085039E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.413 | TFLOPs: 38.57 | 15: iteration 26700/ 125429 | consumed samples: 6835200 | consumed tokens: 13998489600 | elapsed time per iteration (s): 1.06 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.125010E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.377 | TFLOPs: 39.89 | 15: iteration 26710/ 125429 | consumed samples: 6837760 | consumed tokens: 14003732480 | elapsed time per iteration (s): 1.04 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.097229E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.486 | TFLOPs: 40.57 | 15: iteration 26720/ 125429 | consumed samples: 6840320 | consumed tokens: 14008975360 | elapsed time per iteration (s): 1.08 | learning rate: 1.820E-04 | global batch size: 256 | lm loss: 2.116960E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.319 | TFLOPs: 39.05 | 15: iteration 26730/ 125429 | consumed samples: 6842880 | consumed tokens: 14014218240 | elapsed time per iteration (s): 1.08 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.109590E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.467 | TFLOPs: 39.24 | 15: iteration 26740/ 125429 | consumed samples: 6845440 | consumed tokens: 14019461120 | elapsed time per iteration (s): 1.17 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.099019E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.324 | TFLOPs: 36.24 | 15: iteration 26750/ 125429 | consumed samples: 6848000 | consumed tokens: 14024704000 | elapsed time per iteration (s): 1.04 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.118737E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.441 | TFLOPs: 40.73 | 15: iteration 26760/ 125429 | consumed samples: 6850560 | consumed tokens: 14029946880 | elapsed time per iteration (s): 1.05 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.102906E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.637 | TFLOPs: 40.26 | 15: iteration 26770/ 125429 | consumed samples: 6853120 | consumed tokens: 14035189760 | elapsed time per iteration (s): 1.07 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.128045E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.176 | TFLOPs: 39.36 | 15: iteration 26780/ 125429 | consumed samples: 6855680 | consumed tokens: 14040432640 | elapsed time per iteration (s): 1.08 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.120576E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.069 | TFLOPs: 39.34 | 15: iteration 26790/ 125429 | consumed samples: 6858240 | consumed tokens: 14045675520 | elapsed time per iteration (s): 1.06 | learning rate: 1.819E-04 | global batch size: 256 | lm loss: 2.090173E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.464 | TFLOPs: 40.07 | 15: iteration 26800/ 125429 | consumed samples: 6860800 | consumed tokens: 14050918400 | elapsed time per iteration (s): 1.04 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.109725E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.860 | TFLOPs: 40.63 | 15: iteration 26810/ 125429 | consumed samples: 6863360 | consumed tokens: 14056161280 | elapsed time per iteration (s): 1.02 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.084086E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.272 | TFLOPs: 41.36 | 15: iteration 26820/ 125429 | consumed samples: 6865920 | consumed tokens: 14061404160 | elapsed time per iteration (s): 1.06 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.105766E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.391 | TFLOPs: 39.89 | 15: iteration 26830/ 125429 | consumed samples: 6868480 | consumed tokens: 14066647040 | elapsed time per iteration (s): 1.06 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.075956E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.250 | TFLOPs: 40.03 | 15: iteration 26840/ 125429 | consumed samples: 6871040 | consumed tokens: 14071889920 | elapsed time per iteration (s): 1.05 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.097861E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.678 | TFLOPs: 40.27 | 15: iteration 26850/ 125429 | consumed samples: 6873600 | consumed tokens: 14077132800 | elapsed time per iteration (s): 1.12 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.114225E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.007 | TFLOPs: 37.68 | 15: iteration 26860/ 125429 | consumed samples: 6876160 | consumed tokens: 14082375680 | elapsed time per iteration (s): 1.04 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.100097E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.572 | TFLOPs: 40.58 | 15: iteration 26870/ 125429 | consumed samples: 6878720 | consumed tokens: 14087618560 | elapsed time per iteration (s): 1.03 | learning rate: 1.818E-04 | global batch size: 256 | lm loss: 2.104811E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.940 | TFLOPs: 41.14 | 15: iteration 26880/ 125429 | consumed samples: 6881280 | consumed tokens: 14092861440 | elapsed time per iteration (s): 1.14 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.108178E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.473 | TFLOPs: 37.26 | 15: iteration 26890/ 125429 | consumed samples: 6883840 | consumed tokens: 14098104320 | elapsed time per iteration (s): 1.16 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.098105E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.140 | TFLOPs: 36.55 | 15: iteration 26900/ 125429 | consumed samples: 6886400 | consumed tokens: 14103347200 | elapsed time per iteration (s): 1.05 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.121669E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.455 | TFLOPs: 40.40 | 15: iteration 26910/ 125429 | consumed samples: 6888960 | consumed tokens: 14108590080 | elapsed time per iteration (s): 1.14 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.081588E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.622 | TFLOPs: 36.96 | 15: iteration 26920/ 125429 | consumed samples: 6891520 | consumed tokens: 14113832960 | elapsed time per iteration (s): 1.21 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.118220E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.137 | TFLOPs: 35.06 | 15: iteration 26930/ 125429 | consumed samples: 6894080 | consumed tokens: 14119075840 | elapsed time per iteration (s): 1.09 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.110377E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.448 | TFLOPs: 38.74 | 15: iteration 26940/ 125429 | consumed samples: 6896640 | consumed tokens: 14124318720 | elapsed time per iteration (s): 1.10 | learning rate: 1.817E-04 | global batch size: 256 | lm loss: 2.100908E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.737 | TFLOPs: 38.63 | 15: iteration 26950/ 125429 | consumed samples: 6899200 | consumed tokens: 14129561600 | elapsed time per iteration (s): 1.09 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.108236E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.366 | TFLOPs: 38.90 | 15: iteration 26960/ 125429 | consumed samples: 6901760 | consumed tokens: 14134804480 | elapsed time per iteration (s): 1.04 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.100816E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.168 | TFLOPs: 40.85 | 15: iteration 26970/ 125429 | consumed samples: 6904320 | consumed tokens: 14140047360 | elapsed time per iteration (s): 1.06 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.091519E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.442 | TFLOPs: 40.07 | 15: iteration 26980/ 125429 | consumed samples: 6906880 | consumed tokens: 14145290240 | elapsed time per iteration (s): 1.05 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.111018E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.488 | TFLOPs: 40.24 | 15: iteration 26990/ 125429 | consumed samples: 6909440 | consumed tokens: 14150533120 | elapsed time per iteration (s): 1.04 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.124475E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.069 | TFLOPs: 40.50 | 15: iteration 27000/ 125429 | consumed samples: 6912000 | consumed tokens: 14155776000 | elapsed time per iteration (s): 1.03 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.114259E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.538 | TFLOPs: 40.91 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 27000 | lm loss value: 2.083663E+00 | lm loss PPL: 8.033847E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 27000 to checkpoints_1b5 0: [2022-11-26 03:53:24,507] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step27000 is begin to save! 0: [2022-11-26 03:53:24,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_01-model_00-model_states.pt... 0: [2022-11-26 03:53:24,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_01-model_00-model_states.pt. 0: [2022-11-26 03:53:24,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_03-model_00-model_states.pt... 0: [2022-11-26 03:53:24,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_03-model_00-model_states.pt. 0: [2022-11-26 03:53:24,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_04-model_00-model_states.pt... 0: [2022-11-26 03:53:25,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_04-model_00-model_states.pt. 0: [2022-11-26 03:53:25,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_05-model_00-model_states.pt... 0: [2022-11-26 03:53:25,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_05-model_00-model_states.pt. 0: [2022-11-26 03:53:25,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_06-model_00-model_states.pt... 0: [2022-11-26 03:53:25,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_06-model_00-model_states.pt. 0: [2022-11-26 03:53:25,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_07-model_00-model_states.pt... 0: [2022-11-26 03:53:25,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_07-model_00-model_states.pt. 0: [2022-11-26 03:53:25,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_08-model_00-model_states.pt... 0: [2022-11-26 03:53:25,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_08-model_00-model_states.pt. 0: [2022-11-26 03:53:25,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_09-model_00-model_states.pt... 0: [2022-11-26 03:53:25,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_09-model_00-model_states.pt. 0: [2022-11-26 03:53:25,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_10-model_00-model_states.pt... 0: [2022-11-26 03:53:25,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_10-model_00-model_states.pt. 0: [2022-11-26 03:53:25,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_11-model_00-model_states.pt... 0: [2022-11-26 03:53:25,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_11-model_00-model_states.pt. 0: [2022-11-26 03:53:25,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_12-model_00-model_states.pt... 0: [2022-11-26 03:53:25,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_12-model_00-model_states.pt. 0: [2022-11-26 03:53:25,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_13-model_00-model_states.pt... 0: [2022-11-26 03:53:26,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_13-model_00-model_states.pt. 0: [2022-11-26 03:53:26,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_14-model_00-model_states.pt... 0: [2022-11-26 03:53:26,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_14-model_00-model_states.pt. 0: [2022-11-26 03:53:26,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_15-model_00-model_states.pt... 0: [2022-11-26 03:53:26,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_15-model_00-model_states.pt. 0: [2022-11-26 03:53:26,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_16-model_00-model_states.pt... 0: [2022-11-26 03:53:26,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_16-model_00-model_states.pt. 0: [2022-11-26 03:53:26,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_17-model_00-model_states.pt... 0: [2022-11-26 03:53:26,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_17-model_00-model_states.pt. 0: [2022-11-26 03:53:26,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_18-model_00-model_states.pt... 0: [2022-11-26 03:53:26,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_18-model_00-model_states.pt. 0: [2022-11-26 03:53:26,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_19-model_00-model_states.pt... 0: [2022-11-26 03:53:26,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_19-model_00-model_states.pt. 0: [2022-11-26 03:53:26,757] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_20-model_00-model_states.pt... 0: [2022-11-26 03:53:26,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_20-model_00-model_states.pt. 0: [2022-11-26 03:53:26,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_21-model_00-model_states.pt... 0: [2022-11-26 03:53:26,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_21-model_00-model_states.pt. 0: [2022-11-26 03:53:26,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_22-model_00-model_states.pt... 0: [2022-11-26 03:53:27,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_22-model_00-model_states.pt. 0: [2022-11-26 03:53:27,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_23-model_00-model_states.pt... 0: [2022-11-26 03:53:27,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_23-model_00-model_states.pt. 0: [2022-11-26 03:53:27,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_24-model_00-model_states.pt... 0: [2022-11-26 03:53:27,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_24-model_00-model_states.pt. 0: [2022-11-26 03:53:27,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_25-model_00-model_states.pt... 0: [2022-11-26 03:53:27,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_25-model_00-model_states.pt. 0: [2022-11-26 03:53:27,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_26-model_00-model_states.pt... 0: [2022-11-26 03:53:27,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_26-model_00-model_states.pt. 0: [2022-11-26 03:53:27,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_27-model_00-model_states.pt... 0: [2022-11-26 03:53:27,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_27-model_00-model_states.pt. 0: [2022-11-26 03:53:27,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_28-model_00-model_states.pt... 0: [2022-11-26 03:53:27,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_28-model_00-model_states.pt. 0: [2022-11-26 03:53:27,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_29-model_00-model_states.pt... 0: [2022-11-26 03:53:27,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_29-model_00-model_states.pt. 0: [2022-11-26 03:53:27,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_30-model_00-model_states.pt... 0: [2022-11-26 03:53:27,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_30-model_00-model_states.pt. 0: [2022-11-26 03:53:27,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/layer_32-model_00-model_states.pt... 0: [2022-11-26 03:53:28,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/layer_32-model_00-model_states.pt. 0: [2022-11-26 03:53:28,001] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step27000/mp_rank_00_model_states.pt 0: [2022-11-26 03:53:28,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/mp_rank_00_model_states.pt... 0: [2022-11-26 03:53:28,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/mp_rank_00_model_states.pt. 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 1: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 15: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 03:53:28,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step27000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 03:53:28,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:53:28,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 03:53:28,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 03:53:28,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 03:53:28,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:53:28,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 03:53:28,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 03:53:28,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:53:28,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:53:28,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:53:28,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 03:53:28,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 03:53:28,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 03:53:28,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 03:53:28,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 03:53:28,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 03:53:28,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 03:53:28,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 03:53:28,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:53:28,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 03:53:28,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 03:53:28,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:53:28,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:53:28,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 03:53:28,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 03:53:28,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:53:28,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 03:53:28,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:53:28,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 6: [2022-11-26 03:53:28,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 03:53:28,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 03:53:28,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 03:53:28,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 03:53:28,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 03:53:28,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 13: [2022-11-26 03:53:28,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 03:53:28,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 03:53:28,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 03:53:28,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 03:53:28,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:53:28,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 7: [2022-11-26 03:53:28,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 7: [2022-11-26 03:53:28,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 03:53:28,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 03:53:28,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 03:53:28,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 4: [2022-11-26 03:53:28,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 03:53:28,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 03:53:28,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 03:53:28,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 03:53:28,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 03:53:28,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 03:53:28,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 03:53:28,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 3: [2022-11-26 03:53:28,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 03:53:28,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 03:53:28,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 03:53:28,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 03:53:28,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 1: [2022-11-26 03:53:28,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 12: [2022-11-26 03:53:28,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 03:53:28,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 03:53:28,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 8: [2022-11-26 03:53:28,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 03:53:28,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 03:53:28,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:53:28,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:53:28,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:53:28,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:53:28,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 03:53:28,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: [2022-11-26 03:53:28,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 03:53:28,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 03:53:28,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 03:53:28,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 03:53:28,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 03:53:28,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 2: [2022-11-26 03:53:28,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 03:53:28,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 03:53:28,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 11: [2022-11-26 03:53:28,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 03:53:28,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 03:53:28,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 03:53:28,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 14: [2022-11-26 03:53:28,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 10: [2022-11-26 03:53:28,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 9: [2022-11-26 03:53:28,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 03:53:28,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 03:53:28,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 15: [2022-11-26 03:53:28,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 03:53:28,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 03:53:28,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 03:53:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step27000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 5: [2022-11-26 03:53:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step27000 is ready now! 0: successfully saved checkpoint at iteration 27000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4081.16 15: iteration 27010/ 125429 | consumed samples: 6914560 | consumed tokens: 14161018880 | elapsed time per iteration (s): 1.53 | learning rate: 1.816E-04 | global batch size: 256 | lm loss: 2.114559E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 167.082 | TFLOPs: 27.61 | 15: iteration 27020/ 125429 | consumed samples: 6917120 | consumed tokens: 14166261760 | elapsed time per iteration (s): 1.05 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.084100E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.363 | TFLOPs: 40.38 | 15: iteration 27030/ 125429 | consumed samples: 6919680 | consumed tokens: 14171504640 | elapsed time per iteration (s): 1.06 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.090219E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.895 | TFLOPs: 39.81 | 15: iteration 27040/ 125429 | consumed samples: 6922240 | consumed tokens: 14176747520 | elapsed time per iteration (s): 1.06 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.093991E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.087 | TFLOPs: 40.01 | 15: iteration 27050/ 125429 | consumed samples: 6924800 | consumed tokens: 14181990400 | elapsed time per iteration (s): 1.10 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.137727E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.172 | TFLOPs: 38.37 | 15: iteration 27060/ 125429 | consumed samples: 6927360 | consumed tokens: 14187233280 | elapsed time per iteration (s): 1.04 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.124285E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.560 | TFLOPs: 40.75 | 15: iteration 27070/ 125429 | consumed samples: 6929920 | consumed tokens: 14192476160 | elapsed time per iteration (s): 1.05 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.104696E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.608 | TFLOPs: 40.26 | 15: iteration 27080/ 125429 | consumed samples: 6932480 | consumed tokens: 14197719040 | elapsed time per iteration (s): 1.08 | learning rate: 1.815E-04 | global batch size: 256 | lm loss: 2.096560E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.123 | TFLOPs: 39.02 | 15: iteration 27090/ 125429 | consumed samples: 6935040 | consumed tokens: 14202961920 | elapsed time per iteration (s): 1.04 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.107293E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.488 | TFLOPs: 40.73 | 15: iteration 27100/ 125429 | consumed samples: 6937600 | consumed tokens: 14208204800 | elapsed time per iteration (s): 1.04 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.092427E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.310 | TFLOPs: 40.87 | 15: iteration 27110/ 125429 | consumed samples: 6940160 | consumed tokens: 14213447680 | elapsed time per iteration (s): 1.07 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.123885E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.505 | TFLOPs: 39.58 | 15: iteration 27120/ 125429 | consumed samples: 6942720 | consumed tokens: 14218690560 | elapsed time per iteration (s): 1.07 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.097732E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.761 | TFLOPs: 39.46 | 15: iteration 27130/ 125429 | consumed samples: 6945280 | consumed tokens: 14223933440 | elapsed time per iteration (s): 1.09 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.115013E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.625 | TFLOPs: 38.94 | 15: iteration 27140/ 125429 | consumed samples: 6947840 | consumed tokens: 14229176320 | elapsed time per iteration (s): 1.08 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.072887E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.099 | TFLOPs: 39.02 | 15: iteration 27150/ 125429 | consumed samples: 6950400 | consumed tokens: 14234419200 | elapsed time per iteration (s): 1.02 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.104744E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.332 | TFLOPs: 41.53 | 15: iteration 27160/ 125429 | consumed samples: 6952960 | consumed tokens: 14239662080 | elapsed time per iteration (s): 1.05 | learning rate: 1.814E-04 | global batch size: 256 | lm loss: 2.129792E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.793 | TFLOPs: 40.29 | 15: iteration 27170/ 125429 | consumed samples: 6955520 | consumed tokens: 14244904960 | elapsed time per iteration (s): 1.05 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.101190E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.750 | TFLOPs: 40.28 | 15: iteration 27180/ 125429 | consumed samples: 6958080 | consumed tokens: 14250147840 | elapsed time per iteration (s): 1.03 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.102651E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.236 | TFLOPs: 41.02 | 15: iteration 27190/ 125429 | consumed samples: 6960640 | consumed tokens: 14255390720 | elapsed time per iteration (s): 1.05 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.108849E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.784 | TFLOPs: 40.45 | 15: iteration 27200/ 125429 | consumed samples: 6963200 | consumed tokens: 14260633600 | elapsed time per iteration (s): 2.53 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.112349E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 101.252 | TFLOPs: 16.73 | 15: iteration 27210/ 125429 | consumed samples: 6965760 | consumed tokens: 14265876480 | elapsed time per iteration (s): 1.05 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.111966E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.172 | TFLOPs: 40.35 | 15: iteration 27220/ 125429 | consumed samples: 6968320 | consumed tokens: 14271119360 | elapsed time per iteration (s): 1.04 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.107601E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.301 | TFLOPs: 40.70 | 15: iteration 27230/ 125429 | consumed samples: 6970880 | consumed tokens: 14276362240 | elapsed time per iteration (s): 1.07 | learning rate: 1.813E-04 | global batch size: 256 | lm loss: 2.109790E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.105 | TFLOPs: 39.68 | 15: iteration 27240/ 125429 | consumed samples: 6973440 | consumed tokens: 14281605120 | elapsed time per iteration (s): 1.04 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.086471E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.128 | TFLOPs: 40.67 | 15: iteration 27250/ 125429 | consumed samples: 6976000 | consumed tokens: 14286848000 | elapsed time per iteration (s): 1.06 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.102530E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.977 | TFLOPs: 39.99 | 15: iteration 27260/ 125429 | consumed samples: 6978560 | consumed tokens: 14292090880 | elapsed time per iteration (s): 1.05 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.093909E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.714 | TFLOPs: 40.11 | 15: iteration 27270/ 125429 | consumed samples: 6981120 | consumed tokens: 14297333760 | elapsed time per iteration (s): 1.04 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.090558E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.160 | TFLOPs: 40.51 | 15: iteration 27280/ 125429 | consumed samples: 6983680 | consumed tokens: 14302576640 | elapsed time per iteration (s): 1.07 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.108681E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.215 | TFLOPs: 39.37 | 15: iteration 27290/ 125429 | consumed samples: 6986240 | consumed tokens: 14307819520 | elapsed time per iteration (s): 1.04 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.109815E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.422 | TFLOPs: 40.56 | 15: iteration 27300/ 125429 | consumed samples: 6988800 | consumed tokens: 14313062400 | elapsed time per iteration (s): 1.12 | learning rate: 1.812E-04 | global batch size: 256 | lm loss: 2.115671E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.592 | TFLOPs: 37.94 | 15: iteration 27310/ 125429 | consumed samples: 6991360 | consumed tokens: 14318305280 | elapsed time per iteration (s): 1.06 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.115564E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.717 | TFLOPs: 39.95 | 15: iteration 27320/ 125429 | consumed samples: 6993920 | consumed tokens: 14323548160 | elapsed time per iteration (s): 1.09 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.113905E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.450 | TFLOPs: 38.74 | 15: iteration 27330/ 125429 | consumed samples: 6996480 | consumed tokens: 14328791040 | elapsed time per iteration (s): 1.05 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.105799E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.892 | TFLOPs: 40.31 | 15: iteration 27340/ 125429 | consumed samples: 6999040 | consumed tokens: 14334033920 | elapsed time per iteration (s): 1.05 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.102074E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.422 | TFLOPs: 40.39 | 15: iteration 27350/ 125429 | consumed samples: 7001600 | consumed tokens: 14339276800 | elapsed time per iteration (s): 1.03 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.098612E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.040 | TFLOPs: 40.99 | 15: iteration 27360/ 125429 | consumed samples: 7004160 | consumed tokens: 14344519680 | elapsed time per iteration (s): 1.05 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.107229E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.572 | TFLOPs: 40.42 | 15: iteration 27370/ 125429 | consumed samples: 7006720 | consumed tokens: 14349762560 | elapsed time per iteration (s): 1.05 | learning rate: 1.811E-04 | global batch size: 256 | lm loss: 2.146003E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.290 | TFLOPs: 40.21 | 15: iteration 27380/ 125429 | consumed samples: 7009280 | consumed tokens: 14355005440 | elapsed time per iteration (s): 1.03 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.085820E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.734 | TFLOPs: 41.11 | 15: iteration 27390/ 125429 | consumed samples: 7011840 | consumed tokens: 14360248320 | elapsed time per iteration (s): 1.07 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.111229E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.508 | TFLOPs: 39.42 | 15: iteration 27400/ 125429 | consumed samples: 7014400 | consumed tokens: 14365491200 | elapsed time per iteration (s): 1.03 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.092394E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.042 | TFLOPs: 40.99 | 15: iteration 27410/ 125429 | consumed samples: 7016960 | consumed tokens: 14370734080 | elapsed time per iteration (s): 1.05 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.100185E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.258 | TFLOPs: 40.37 | 15: iteration 27420/ 125429 | consumed samples: 7019520 | consumed tokens: 14375976960 | elapsed time per iteration (s): 1.07 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.118489E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.852 | TFLOPs: 39.64 | 15: iteration 27430/ 125429 | consumed samples: 7022080 | consumed tokens: 14381219840 | elapsed time per iteration (s): 1.08 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.114343E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.198 | TFLOPs: 39.03 | 15: iteration 27440/ 125429 | consumed samples: 7024640 | consumed tokens: 14386462720 | elapsed time per iteration (s): 1.05 | learning rate: 1.810E-04 | global batch size: 256 | lm loss: 2.115802E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.953 | TFLOPs: 40.32 | 15: iteration 27450/ 125429 | consumed samples: 7027200 | consumed tokens: 14391705600 | elapsed time per iteration (s): 1.03 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.105161E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.896 | TFLOPs: 40.97 | 15: iteration 27460/ 125429 | consumed samples: 7029760 | consumed tokens: 14396948480 | elapsed time per iteration (s): 1.07 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.131000E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.895 | TFLOPs: 39.64 | 15: iteration 27470/ 125429 | consumed samples: 7032320 | consumed tokens: 14402191360 | elapsed time per iteration (s): 1.11 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.105819E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.603 | TFLOPs: 38.27 | 15: iteration 27480/ 125429 | consumed samples: 7034880 | consumed tokens: 14407434240 | elapsed time per iteration (s): 1.04 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.111657E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.465 | TFLOPs: 40.56 | 15: iteration 27490/ 125429 | consumed samples: 7037440 | consumed tokens: 14412677120 | elapsed time per iteration (s): 1.09 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.123654E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.005 | TFLOPs: 38.67 | 15: iteration 27500/ 125429 | consumed samples: 7040000 | consumed tokens: 14417920000 | elapsed time per iteration (s): 1.03 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.123857E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.359 | TFLOPs: 41.21 | 15: iteration 27510/ 125429 | consumed samples: 7042560 | consumed tokens: 14423162880 | elapsed time per iteration (s): 1.05 | learning rate: 1.809E-04 | global batch size: 256 | lm loss: 2.127937E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.657 | TFLOPs: 40.43 | 15: iteration 27520/ 125429 | consumed samples: 7045120 | consumed tokens: 14428405760 | elapsed time per iteration (s): 1.04 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.093773E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.836 | TFLOPs: 40.79 | 15: iteration 27530/ 125429 | consumed samples: 7047680 | consumed tokens: 14433648640 | elapsed time per iteration (s): 1.02 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.117788E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.225 | TFLOPs: 41.52 | 15: iteration 27540/ 125429 | consumed samples: 7050240 | consumed tokens: 14438891520 | elapsed time per iteration (s): 1.04 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.092837E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.176 | TFLOPs: 40.52 | 15: iteration 27550/ 125429 | consumed samples: 7052800 | consumed tokens: 14444134400 | elapsed time per iteration (s): 1.06 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.099553E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.179 | TFLOPs: 40.02 | 15: iteration 27560/ 125429 | consumed samples: 7055360 | consumed tokens: 14449377280 | elapsed time per iteration (s): 1.03 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.118100E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.440 | TFLOPs: 40.89 | 15: iteration 27570/ 125429 | consumed samples: 7057920 | consumed tokens: 14454620160 | elapsed time per iteration (s): 1.03 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.101952E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.985 | TFLOPs: 40.98 | 15: iteration 27580/ 125429 | consumed samples: 7060480 | consumed tokens: 14459863040 | elapsed time per iteration (s): 1.04 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.109438E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.230 | TFLOPs: 40.69 | 15: iteration 27590/ 125429 | consumed samples: 7063040 | consumed tokens: 14465105920 | elapsed time per iteration (s): 1.05 | learning rate: 1.808E-04 | global batch size: 256 | lm loss: 2.117535E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.072 | TFLOPs: 40.17 | 15: iteration 27600/ 125429 | consumed samples: 7065600 | consumed tokens: 14470348800 | elapsed time per iteration (s): 1.06 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.125053E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.184 | TFLOPs: 39.86 | 15: iteration 27610/ 125429 | consumed samples: 7068160 | consumed tokens: 14475591680 | elapsed time per iteration (s): 1.06 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.104797E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.308 | TFLOPs: 39.88 | 15: iteration 27620/ 125429 | consumed samples: 7070720 | consumed tokens: 14480834560 | elapsed time per iteration (s): 1.04 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.106372E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.034 | TFLOPs: 40.49 | 15: iteration 27630/ 125429 | consumed samples: 7073280 | consumed tokens: 14486077440 | elapsed time per iteration (s): 1.06 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.117415E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.745 | TFLOPs: 39.95 | 15: iteration 27640/ 125429 | consumed samples: 7075840 | consumed tokens: 14491320320 | elapsed time per iteration (s): 1.03 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.116493E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.187 | TFLOPs: 41.18 | 15: iteration 27650/ 125429 | consumed samples: 7078400 | consumed tokens: 14496563200 | elapsed time per iteration (s): 1.06 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.110454E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.746 | TFLOPs: 39.95 | 15: iteration 27660/ 125429 | consumed samples: 7080960 | consumed tokens: 14501806080 | elapsed time per iteration (s): 1.09 | learning rate: 1.807E-04 | global batch size: 256 | lm loss: 2.103173E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.360 | TFLOPs: 38.89 | 15: iteration 27670/ 125429 | consumed samples: 7083520 | consumed tokens: 14507048960 | elapsed time per iteration (s): 1.02 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.123590E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.849 | TFLOPs: 41.29 | 15: iteration 27680/ 125429 | consumed samples: 7086080 | consumed tokens: 14512291840 | elapsed time per iteration (s): 1.04 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.118954E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.784 | TFLOPs: 40.62 | 15: iteration 27690/ 125429 | consumed samples: 7088640 | consumed tokens: 14517534720 | elapsed time per iteration (s): 1.04 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.127684E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.063 | TFLOPs: 40.50 | 15: iteration 27700/ 125429 | consumed samples: 7091200 | consumed tokens: 14522777600 | elapsed time per iteration (s): 1.05 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.096418E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.326 | TFLOPs: 40.21 | 15: iteration 27710/ 125429 | consumed samples: 7093760 | consumed tokens: 14528020480 | elapsed time per iteration (s): 1.03 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.137773E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.659 | TFLOPs: 41.26 | 15: iteration 27720/ 125429 | consumed samples: 7096320 | consumed tokens: 14533263360 | elapsed time per iteration (s): 1.06 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.124033E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.560 | TFLOPs: 40.08 | 15: iteration 27730/ 125429 | consumed samples: 7098880 | consumed tokens: 14538506240 | elapsed time per iteration (s): 1.07 | learning rate: 1.806E-04 | global batch size: 256 | lm loss: 2.107354E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.584 | TFLOPs: 39.43 | 15: iteration 27740/ 125429 | consumed samples: 7101440 | consumed tokens: 14543749120 | elapsed time per iteration (s): 1.03 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.112657E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.519 | TFLOPs: 40.90 | 15: iteration 27750/ 125429 | consumed samples: 7104000 | consumed tokens: 14548992000 | elapsed time per iteration (s): 1.05 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.125536E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.443 | TFLOPs: 40.23 | 15: iteration 27760/ 125429 | consumed samples: 7106560 | consumed tokens: 14554234880 | elapsed time per iteration (s): 1.04 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.112490E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.588 | TFLOPs: 40.75 | 15: iteration 27770/ 125429 | consumed samples: 7109120 | consumed tokens: 14559477760 | elapsed time per iteration (s): 1.04 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.086607E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.551 | TFLOPs: 40.74 | 15: iteration 27780/ 125429 | consumed samples: 7111680 | consumed tokens: 14564720640 | elapsed time per iteration (s): 1.02 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.138721E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.546 | TFLOPs: 41.40 | 15: iteration 27790/ 125429 | consumed samples: 7114240 | consumed tokens: 14569963520 | elapsed time per iteration (s): 1.06 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.098259E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.537 | TFLOPs: 40.08 | 15: iteration 27800/ 125429 | consumed samples: 7116800 | consumed tokens: 14575206400 | elapsed time per iteration (s): 1.07 | learning rate: 1.805E-04 | global batch size: 256 | lm loss: 2.096453E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.229 | TFLOPs: 39.70 | 15: iteration 27810/ 125429 | consumed samples: 7119360 | consumed tokens: 14580449280 | elapsed time per iteration (s): 1.05 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.092859E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.128 | TFLOPs: 40.34 | 15: iteration 27820/ 125429 | consumed samples: 7121920 | consumed tokens: 14585692160 | elapsed time per iteration (s): 1.03 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.095022E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.360 | TFLOPs: 41.04 | 15: iteration 27830/ 125429 | consumed samples: 7124480 | consumed tokens: 14590935040 | elapsed time per iteration (s): 1.23 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.116895E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 207.568 | TFLOPs: 34.30 | 15: iteration 27840/ 125429 | consumed samples: 7127040 | consumed tokens: 14596177920 | elapsed time per iteration (s): 1.09 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.120156E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.876 | TFLOPs: 38.98 | 15: iteration 27850/ 125429 | consumed samples: 7129600 | consumed tokens: 14601420800 | elapsed time per iteration (s): 1.06 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.127002E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.065 | TFLOPs: 39.84 | 15: iteration 27860/ 125429 | consumed samples: 7132160 | consumed tokens: 14606663680 | elapsed time per iteration (s): 1.02 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.128809E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.320 | TFLOPs: 41.37 | 15: iteration 27870/ 125429 | consumed samples: 7134720 | consumed tokens: 14611906560 | elapsed time per iteration (s): 1.02 | learning rate: 1.804E-04 | global batch size: 256 | lm loss: 2.087455E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.791 | TFLOPs: 41.28 | 15: iteration 27880/ 125429 | consumed samples: 7137280 | consumed tokens: 14617149440 | elapsed time per iteration (s): 1.03 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.134215E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.425 | TFLOPs: 41.05 | 15: iteration 27890/ 125429 | consumed samples: 7139840 | consumed tokens: 14622392320 | elapsed time per iteration (s): 1.02 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.086389E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.256 | TFLOPs: 41.52 | 15: iteration 27900/ 125429 | consumed samples: 7142400 | consumed tokens: 14627635200 | elapsed time per iteration (s): 1.05 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.061632E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.881 | TFLOPs: 40.47 | 15: iteration 27910/ 125429 | consumed samples: 7144960 | consumed tokens: 14632878080 | elapsed time per iteration (s): 1.04 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.110065E+00 | grad norm: 0.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.750 | TFLOPs: 40.78 | 15: iteration 27920/ 125429 | consumed samples: 7147520 | consumed tokens: 14638120960 | elapsed time per iteration (s): 1.04 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.589616E+00 | grad norm: 7.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.307 | TFLOPs: 40.70 | 15: iteration 27930/ 125429 | consumed samples: 7150080 | consumed tokens: 14643363840 | elapsed time per iteration (s): 1.08 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.786237E+00 | grad norm: 1.223 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.383 | TFLOPs: 39.06 | 15: iteration 27940/ 125429 | consumed samples: 7152640 | consumed tokens: 14648606720 | elapsed time per iteration (s): 1.03 | learning rate: 1.803E-04 | global batch size: 256 | lm loss: 2.265464E+00 | grad norm: 0.463 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.870 | TFLOPs: 41.13 | 15: iteration 27950/ 125429 | consumed samples: 7155200 | consumed tokens: 14653849600 | elapsed time per iteration (s): 1.02 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.169634E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.874 | TFLOPs: 41.29 | 15: iteration 27960/ 125429 | consumed samples: 7157760 | consumed tokens: 14659092480 | elapsed time per iteration (s): 1.05 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.140409E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.323 | TFLOPs: 40.38 | 15: iteration 27970/ 125429 | consumed samples: 7160320 | consumed tokens: 14664335360 | elapsed time per iteration (s): 1.05 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.129817E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.060 | TFLOPs: 40.17 | 15: iteration 27980/ 125429 | consumed samples: 7162880 | consumed tokens: 14669578240 | elapsed time per iteration (s): 1.05 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.135649E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.986 | TFLOPs: 40.32 | 15: iteration 27990/ 125429 | consumed samples: 7165440 | consumed tokens: 14674821120 | elapsed time per iteration (s): 1.03 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.116836E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.588 | TFLOPs: 41.08 | 0: [2022-11-26 04:11:16,028] [INFO] [logging.py:68:log_dist] [Rank 0] step=28000, skipped=0, lr=[0.0001801701558125968, 0.0001801701558125968, 0.0001801701558125968], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 28000/ 125429 | consumed samples: 7168000 | consumed tokens: 14680064000 | elapsed time per iteration (s): 1.03 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.120044E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.522 | TFLOPs: 41.24 | 0: steps: 28000 loss: 2.1663 iter time (s): 1.104 samples/sec: 231.971 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 28000 | lm loss value: 2.128025E+00 | lm loss PPL: 8.398264E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 28000 to checkpoints_1b5 0: [2022-11-26 04:11:16,402] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step28000 is begin to save! 0: [2022-11-26 04:11:16,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_01-model_00-model_states.pt... 0: [2022-11-26 04:11:16,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_01-model_00-model_states.pt. 0: [2022-11-26 04:11:16,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_03-model_00-model_states.pt... 0: [2022-11-26 04:11:16,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_03-model_00-model_states.pt. 0: [2022-11-26 04:11:16,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_04-model_00-model_states.pt... 0: [2022-11-26 04:11:16,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_04-model_00-model_states.pt. 0: [2022-11-26 04:11:16,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_05-model_00-model_states.pt... 0: [2022-11-26 04:11:16,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_05-model_00-model_states.pt. 0: [2022-11-26 04:11:16,961] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_06-model_00-model_states.pt... 0: [2022-11-26 04:11:17,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_06-model_00-model_states.pt. 0: [2022-11-26 04:11:17,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_07-model_00-model_states.pt... 0: [2022-11-26 04:11:17,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_07-model_00-model_states.pt. 0: [2022-11-26 04:11:17,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_08-model_00-model_states.pt... 0: [2022-11-26 04:11:17,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_08-model_00-model_states.pt. 0: [2022-11-26 04:11:17,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_09-model_00-model_states.pt... 0: [2022-11-26 04:11:17,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_09-model_00-model_states.pt. 0: [2022-11-26 04:11:17,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_10-model_00-model_states.pt... 0: [2022-11-26 04:11:17,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_10-model_00-model_states.pt. 0: [2022-11-26 04:11:17,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_11-model_00-model_states.pt... 0: [2022-11-26 04:11:17,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_11-model_00-model_states.pt. 0: [2022-11-26 04:11:17,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_12-model_00-model_states.pt... 0: [2022-11-26 04:11:17,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_12-model_00-model_states.pt. 0: [2022-11-26 04:11:17,684] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_13-model_00-model_states.pt... 0: [2022-11-26 04:11:17,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_13-model_00-model_states.pt. 0: [2022-11-26 04:11:17,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_14-model_00-model_states.pt... 0: [2022-11-26 04:11:17,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_14-model_00-model_states.pt. 0: [2022-11-26 04:11:17,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_15-model_00-model_states.pt... 0: [2022-11-26 04:11:17,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_15-model_00-model_states.pt. 0: [2022-11-26 04:11:17,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_16-model_00-model_states.pt... 0: [2022-11-26 04:11:18,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_16-model_00-model_states.pt. 0: [2022-11-26 04:11:18,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_17-model_00-model_states.pt... 0: [2022-11-26 04:11:18,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_17-model_00-model_states.pt. 0: [2022-11-26 04:11:18,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_18-model_00-model_states.pt... 0: [2022-11-26 04:11:18,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_18-model_00-model_states.pt. 0: [2022-11-26 04:11:18,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_19-model_00-model_states.pt... 0: [2022-11-26 04:11:18,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_19-model_00-model_states.pt. 0: [2022-11-26 04:11:18,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_20-model_00-model_states.pt... 0: [2022-11-26 04:11:18,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_20-model_00-model_states.pt. 0: [2022-11-26 04:11:18,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_21-model_00-model_states.pt... 0: [2022-11-26 04:11:18,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_21-model_00-model_states.pt. 0: [2022-11-26 04:11:18,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_22-model_00-model_states.pt... 0: [2022-11-26 04:11:18,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_22-model_00-model_states.pt. 0: [2022-11-26 04:11:18,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_23-model_00-model_states.pt... 0: [2022-11-26 04:11:18,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_23-model_00-model_states.pt. 0: [2022-11-26 04:11:18,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_24-model_00-model_states.pt... 0: [2022-11-26 04:11:18,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_24-model_00-model_states.pt. 0: [2022-11-26 04:11:18,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_25-model_00-model_states.pt... 0: [2022-11-26 04:11:19,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_25-model_00-model_states.pt. 0: [2022-11-26 04:11:19,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_26-model_00-model_states.pt... 0: [2022-11-26 04:11:19,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_26-model_00-model_states.pt. 0: [2022-11-26 04:11:19,115] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_27-model_00-model_states.pt... 0: [2022-11-26 04:11:19,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_27-model_00-model_states.pt. 0: [2022-11-26 04:11:19,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_28-model_00-model_states.pt... 0: [2022-11-26 04:11:19,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_28-model_00-model_states.pt. 0: [2022-11-26 04:11:19,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_29-model_00-model_states.pt... 0: [2022-11-26 04:11:19,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_29-model_00-model_states.pt. 0: [2022-11-26 04:11:19,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_30-model_00-model_states.pt... 0: [2022-11-26 04:11:19,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_30-model_00-model_states.pt. 0: [2022-11-26 04:11:19,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/layer_32-model_00-model_states.pt... 0: [2022-11-26 04:11:19,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/layer_32-model_00-model_states.pt. 0: [2022-11-26 04:11:19,525] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step28000/mp_rank_00_model_states.pt 0: [2022-11-26 04:11:19,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/mp_rank_00_model_states.pt... 0: [2022-11-26 04:11:19,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/mp_rank_00_model_states.pt. 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:11:19,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:11:19,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 04:11:19,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:11:19,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:11:19,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 04:11:19,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 04:11:19,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:11:19,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 04:11:19,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 04:11:19,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 04:11:19,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 04:11:19,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 04:11:19,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 04:11:19,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 04:11:19,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 04:11:19,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:11:19,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:11:19,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:11:19,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:11:19,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:11:19,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:11:19,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 04:11:19,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:11:19,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:11:19,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:11:19,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:11:19,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 04:11:19,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:11:19,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 04:11:19,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:11:19,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 04:11:19,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 04:11:19,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 04:11:19,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 04:11:19,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 04:11:19,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 04:11:19,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 04:11:19,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 04:11:19,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 04:11:19,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 4: [2022-11-26 04:11:19,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 04:11:19,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:11:19,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 04:11:19,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 04:11:19,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:11:19,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 04:11:19,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 04:11:19,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:11:19,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 9: [2022-11-26 04:11:19,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:11:19,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:11:19,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:11:19,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 7: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 10: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:11:19,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 8: [2022-11-26 04:11:19,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 7: [2022-11-26 04:11:19,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:11:19,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:11:19,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:11:19,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:11:19,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 10: [2022-11-26 04:11:19,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:11:19,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 04:11:19,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 04:11:19,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 04:11:19,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 04:11:19,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 04:11:19,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 6: [2022-11-26 04:11:19,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 04:11:19,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 11: [2022-11-26 04:11:19,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:11:19,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 04:11:19,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 04:11:19,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 04:11:19,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:11:19,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 15: [2022-11-26 04:11:19,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:11:19,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 04:11:19,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 04:11:19,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:11:19,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 04:11:19,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 3: [2022-11-26 04:11:19,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 7: [2022-11-26 04:11:19,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:11:19,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 04:11:19,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 04:11:19,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:11:19,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 04:11:19,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 4: [2022-11-26 04:11:19,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 04:11:19,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 8: [2022-11-26 04:11:19,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:11:19,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 04:11:19,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:11:19,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 13: [2022-11-26 04:11:19,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:11:19,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 04:11:19,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 12: [2022-11-26 04:11:19,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:11:19,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 04:11:19,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 1: [2022-11-26 04:11:19,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:11:19,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 04:11:19,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 5: [2022-11-26 04:11:19,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 04:11:19,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:11:19,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:11:19,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 14: [2022-11-26 04:11:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 04:11:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: [2022-11-26 04:11:19,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 04:11:19,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 2: [2022-11-26 04:11:19,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:11:19,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step28000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 04:11:19,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step28000 is ready now! 0: successfully saved checkpoint at iteration 28000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3590.82 15: iteration 28010/ 125429 | consumed samples: 7170560 | consumed tokens: 14685306880 | elapsed time per iteration (s): 1.43 | learning rate: 1.802E-04 | global batch size: 256 | lm loss: 2.141376E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.515 | TFLOPs: 29.50 | 15: iteration 28020/ 125429 | consumed samples: 7173120 | consumed tokens: 14690549760 | elapsed time per iteration (s): 1.05 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.089328E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.400 | TFLOPs: 40.39 | 15: iteration 28030/ 125429 | consumed samples: 7175680 | consumed tokens: 14695792640 | elapsed time per iteration (s): 1.04 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.125288E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.800 | TFLOPs: 40.62 | 15: iteration 28040/ 125429 | consumed samples: 7178240 | consumed tokens: 14701035520 | elapsed time per iteration (s): 1.03 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.121102E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.426 | TFLOPs: 41.05 | 15: iteration 28050/ 125429 | consumed samples: 7180800 | consumed tokens: 14706278400 | elapsed time per iteration (s): 1.03 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.091641E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.650 | TFLOPs: 41.26 | 15: iteration 28060/ 125429 | consumed samples: 7183360 | consumed tokens: 14711521280 | elapsed time per iteration (s): 1.08 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.102500E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.908 | TFLOPs: 39.32 | 15: iteration 28070/ 125429 | consumed samples: 7185920 | consumed tokens: 14716764160 | elapsed time per iteration (s): 1.03 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.141679E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.491 | TFLOPs: 41.07 | 15: iteration 28080/ 125429 | consumed samples: 7188480 | consumed tokens: 14722007040 | elapsed time per iteration (s): 1.07 | learning rate: 1.801E-04 | global batch size: 256 | lm loss: 2.106196E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.375 | TFLOPs: 39.56 | 15: iteration 28090/ 125429 | consumed samples: 7191040 | consumed tokens: 14727249920 | elapsed time per iteration (s): 1.03 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.109347E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.160 | TFLOPs: 41.18 | 15: iteration 28100/ 125429 | consumed samples: 7193600 | consumed tokens: 14732492800 | elapsed time per iteration (s): 1.13 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.126089E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.699 | TFLOPs: 37.46 | 15: iteration 28110/ 125429 | consumed samples: 7196160 | consumed tokens: 14737735680 | elapsed time per iteration (s): 1.10 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.109897E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.183 | TFLOPs: 38.37 | 15: iteration 28120/ 125429 | consumed samples: 7198720 | consumed tokens: 14742978560 | elapsed time per iteration (s): 1.06 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.107143E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.578 | TFLOPs: 40.09 | 15: iteration 28130/ 125429 | consumed samples: 7201280 | consumed tokens: 14748221440 | elapsed time per iteration (s): 1.04 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.133505E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.648 | TFLOPs: 40.60 | 15: iteration 28140/ 125429 | consumed samples: 7203840 | consumed tokens: 14753464320 | elapsed time per iteration (s): 1.04 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.122307E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.316 | TFLOPs: 40.71 | 15: iteration 28150/ 125429 | consumed samples: 7206400 | consumed tokens: 14758707200 | elapsed time per iteration (s): 1.04 | learning rate: 1.800E-04 | global batch size: 256 | lm loss: 2.116822E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.035 | TFLOPs: 40.66 | 15: iteration 28160/ 125429 | consumed samples: 7208960 | consumed tokens: 14763950080 | elapsed time per iteration (s): 1.05 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.086664E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.021 | TFLOPs: 40.33 | 15: iteration 28170/ 125429 | consumed samples: 7211520 | consumed tokens: 14769192960 | elapsed time per iteration (s): 1.03 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.102126E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.146 | TFLOPs: 41.01 | 15: iteration 28180/ 125429 | consumed samples: 7214080 | consumed tokens: 14774435840 | elapsed time per iteration (s): 1.03 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.088872E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.187 | TFLOPs: 41.01 | 15: iteration 28190/ 125429 | consumed samples: 7216640 | consumed tokens: 14779678720 | elapsed time per iteration (s): 1.04 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.110443E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.748 | TFLOPs: 40.78 | 15: iteration 28200/ 125429 | consumed samples: 7219200 | consumed tokens: 14784921600 | elapsed time per iteration (s): 1.08 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.075525E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.132 | TFLOPs: 39.35 | 15: iteration 28210/ 125429 | consumed samples: 7221760 | consumed tokens: 14790164480 | elapsed time per iteration (s): 1.08 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.091373E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.145 | TFLOPs: 39.02 | 15: iteration 28220/ 125429 | consumed samples: 7224320 | consumed tokens: 14795407360 | elapsed time per iteration (s): 1.05 | learning rate: 1.799E-04 | global batch size: 256 | lm loss: 2.123273E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.544 | TFLOPs: 40.25 | 15: iteration 28230/ 125429 | consumed samples: 7226880 | consumed tokens: 14800650240 | elapsed time per iteration (s): 1.05 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.094686E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.658 | TFLOPs: 40.43 | 15: iteration 28240/ 125429 | consumed samples: 7229440 | consumed tokens: 14805893120 | elapsed time per iteration (s): 1.02 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.095957E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.433 | TFLOPs: 41.39 | 15: iteration 28250/ 125429 | consumed samples: 7232000 | consumed tokens: 14811136000 | elapsed time per iteration (s): 1.06 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.104182E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.085 | TFLOPs: 40.01 | 15: iteration 28260/ 125429 | consumed samples: 7234560 | consumed tokens: 14816378880 | elapsed time per iteration (s): 1.06 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.099113E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.492 | TFLOPs: 39.91 | 15: iteration 28270/ 125429 | consumed samples: 7237120 | consumed tokens: 14821621760 | elapsed time per iteration (s): 1.03 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.088274E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.121 | TFLOPs: 41.17 | 15: iteration 28280/ 125429 | consumed samples: 7239680 | consumed tokens: 14826864640 | elapsed time per iteration (s): 1.06 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.110562E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.644 | TFLOPs: 40.10 | 15: iteration 28290/ 125429 | consumed samples: 7242240 | consumed tokens: 14832107520 | elapsed time per iteration (s): 1.03 | learning rate: 1.798E-04 | global batch size: 256 | lm loss: 2.117065E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.752 | TFLOPs: 41.11 | 15: iteration 28300/ 125429 | consumed samples: 7244800 | consumed tokens: 14837350400 | elapsed time per iteration (s): 1.04 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.099165E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.069 | TFLOPs: 40.83 | 15: iteration 28310/ 125429 | consumed samples: 7247360 | consumed tokens: 14842593280 | elapsed time per iteration (s): 1.02 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.130144E+00 | grad norm: 0.259 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.210 | TFLOPs: 41.35 | 15: iteration 28320/ 125429 | consumed samples: 7249920 | consumed tokens: 14847836160 | elapsed time per iteration (s): 1.03 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.156564E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.480 | TFLOPs: 41.23 | 15: iteration 28330/ 125429 | consumed samples: 7252480 | consumed tokens: 14853079040 | elapsed time per iteration (s): 1.03 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.127564E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.365 | TFLOPs: 40.88 | 15: iteration 28340/ 125429 | consumed samples: 7255040 | consumed tokens: 14858321920 | elapsed time per iteration (s): 1.04 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.110826E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.140 | TFLOPs: 40.51 | 15: iteration 28350/ 125429 | consumed samples: 7257600 | consumed tokens: 14863564800 | elapsed time per iteration (s): 1.06 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.093399E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.352 | TFLOPs: 40.05 | 15: iteration 28360/ 125429 | consumed samples: 7260160 | consumed tokens: 14868807680 | elapsed time per iteration (s): 1.04 | learning rate: 1.797E-04 | global batch size: 256 | lm loss: 2.113353E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.982 | TFLOPs: 40.49 | 15: iteration 28370/ 125429 | consumed samples: 7262720 | consumed tokens: 14874050560 | elapsed time per iteration (s): 1.09 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.120353E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.606 | TFLOPs: 38.77 | 15: iteration 28380/ 125429 | consumed samples: 7265280 | consumed tokens: 14879293440 | elapsed time per iteration (s): 1.05 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.086377E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.131 | TFLOPs: 40.18 | 15: iteration 28390/ 125429 | consumed samples: 7267840 | consumed tokens: 14884536320 | elapsed time per iteration (s): 1.07 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.098196E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.686 | TFLOPs: 39.61 | 15: iteration 28400/ 125429 | consumed samples: 7270400 | consumed tokens: 14889779200 | elapsed time per iteration (s): 1.03 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.123686E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.888 | TFLOPs: 40.97 | 15: iteration 28410/ 125429 | consumed samples: 7272960 | consumed tokens: 14895022080 | elapsed time per iteration (s): 1.03 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.102688E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.107 | TFLOPs: 41.17 | 15: iteration 28420/ 125429 | consumed samples: 7275520 | consumed tokens: 14900264960 | elapsed time per iteration (s): 1.07 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.130328E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.620 | TFLOPs: 39.43 | 15: iteration 28430/ 125429 | consumed samples: 7278080 | consumed tokens: 14905507840 | elapsed time per iteration (s): 1.05 | learning rate: 1.796E-04 | global batch size: 256 | lm loss: 2.079810E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.316 | TFLOPs: 40.21 | 15: iteration 28440/ 125429 | consumed samples: 7280640 | consumed tokens: 14910750720 | elapsed time per iteration (s): 1.03 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.094146E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.336 | TFLOPs: 41.04 | 15: iteration 28450/ 125429 | consumed samples: 7283200 | consumed tokens: 14915993600 | elapsed time per iteration (s): 1.05 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.079548E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.392 | TFLOPs: 40.22 | 15: iteration 28460/ 125429 | consumed samples: 7285760 | consumed tokens: 14921236480 | elapsed time per iteration (s): 1.05 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.099893E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.477 | TFLOPs: 40.40 | 15: iteration 28470/ 125429 | consumed samples: 7288320 | consumed tokens: 14926479360 | elapsed time per iteration (s): 1.05 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.066477E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.070 | TFLOPs: 40.17 | 15: iteration 28480/ 125429 | consumed samples: 7290880 | consumed tokens: 14931722240 | elapsed time per iteration (s): 1.07 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.108781E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.829 | TFLOPs: 39.47 | 15: iteration 28490/ 125429 | consumed samples: 7293440 | consumed tokens: 14936965120 | elapsed time per iteration (s): 1.03 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.106778E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.678 | TFLOPs: 41.10 | 15: iteration 28500/ 125429 | consumed samples: 7296000 | consumed tokens: 14942208000 | elapsed time per iteration (s): 1.07 | learning rate: 1.795E-04 | global batch size: 256 | lm loss: 2.114496E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.757 | TFLOPs: 39.62 | 15: iteration 28510/ 125429 | consumed samples: 7298560 | consumed tokens: 14947450880 | elapsed time per iteration (s): 1.03 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.109249E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.693 | TFLOPs: 41.26 | 15: iteration 28520/ 125429 | consumed samples: 7301120 | consumed tokens: 14952693760 | elapsed time per iteration (s): 1.05 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.133848E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.794 | TFLOPs: 40.12 | 15: iteration 28530/ 125429 | consumed samples: 7303680 | consumed tokens: 14957936640 | elapsed time per iteration (s): 1.04 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.096453E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.674 | TFLOPs: 40.60 | 15: iteration 28540/ 125429 | consumed samples: 7306240 | consumed tokens: 14963179520 | elapsed time per iteration (s): 1.03 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.117260E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.720 | TFLOPs: 41.10 | 15: iteration 28550/ 125429 | consumed samples: 7308800 | consumed tokens: 14968422400 | elapsed time per iteration (s): 1.06 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.113021E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.575 | TFLOPs: 39.92 | 15: iteration 28560/ 125429 | consumed samples: 7311360 | consumed tokens: 14973665280 | elapsed time per iteration (s): 1.04 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.111694E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.786 | TFLOPs: 40.78 | 15: iteration 28570/ 125429 | consumed samples: 7313920 | consumed tokens: 14978908160 | elapsed time per iteration (s): 1.08 | learning rate: 1.794E-04 | global batch size: 256 | lm loss: 2.086056E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.148 | TFLOPs: 39.03 | 15: iteration 28580/ 125429 | consumed samples: 7316480 | consumed tokens: 14984151040 | elapsed time per iteration (s): 1.05 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.108265E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.856 | TFLOPs: 40.46 | 15: iteration 28590/ 125429 | consumed samples: 7319040 | consumed tokens: 14989393920 | elapsed time per iteration (s): 1.08 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.098419E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.958 | TFLOPs: 39.16 | 15: iteration 28600/ 125429 | consumed samples: 7321600 | consumed tokens: 14994636800 | elapsed time per iteration (s): 1.02 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.118036E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.825 | TFLOPs: 41.45 | 15: iteration 28610/ 125429 | consumed samples: 7324160 | consumed tokens: 14999879680 | elapsed time per iteration (s): 1.08 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.103966E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.478 | TFLOPs: 39.08 | 15: iteration 28620/ 125429 | consumed samples: 7326720 | consumed tokens: 15005122560 | elapsed time per iteration (s): 1.06 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.100052E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.618 | TFLOPs: 39.76 | 15: iteration 28630/ 125429 | consumed samples: 7329280 | consumed tokens: 15010365440 | elapsed time per iteration (s): 1.08 | learning rate: 1.793E-04 | global batch size: 256 | lm loss: 2.102524E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.948 | TFLOPs: 39.32 | 15: iteration 28640/ 125429 | consumed samples: 7331840 | consumed tokens: 15015608320 | elapsed time per iteration (s): 1.04 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.115388E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.862 | TFLOPs: 40.63 | 15: iteration 28650/ 125429 | consumed samples: 7334400 | consumed tokens: 15020851200 | elapsed time per iteration (s): 1.05 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.089507E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.305 | TFLOPs: 40.37 | 15: iteration 28660/ 125429 | consumed samples: 7336960 | consumed tokens: 15026094080 | elapsed time per iteration (s): 1.06 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.118946E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.537 | TFLOPs: 40.08 | 15: iteration 28670/ 125429 | consumed samples: 7339520 | consumed tokens: 15031336960 | elapsed time per iteration (s): 1.03 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.100996E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.040 | TFLOPs: 41.16 | 15: iteration 28680/ 125429 | consumed samples: 7342080 | consumed tokens: 15036579840 | elapsed time per iteration (s): 1.04 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.111712E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.171 | TFLOPs: 40.52 | 15: iteration 28690/ 125429 | consumed samples: 7344640 | consumed tokens: 15041822720 | elapsed time per iteration (s): 1.04 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.090906E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.369 | TFLOPs: 40.71 | 15: iteration 28700/ 125429 | consumed samples: 7347200 | consumed tokens: 15047065600 | elapsed time per iteration (s): 1.04 | learning rate: 1.792E-04 | global batch size: 256 | lm loss: 2.117532E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.284 | TFLOPs: 40.87 | 15: iteration 28710/ 125429 | consumed samples: 7349760 | consumed tokens: 15052308480 | elapsed time per iteration (s): 1.03 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.090689E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.240 | TFLOPs: 41.19 | 15: iteration 28720/ 125429 | consumed samples: 7352320 | consumed tokens: 15057551360 | elapsed time per iteration (s): 1.03 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.111246E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.176 | TFLOPs: 41.01 | 15: iteration 28730/ 125429 | consumed samples: 7354880 | consumed tokens: 15062794240 | elapsed time per iteration (s): 1.08 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.121685E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.475 | TFLOPs: 39.24 | 15: iteration 28740/ 125429 | consumed samples: 7357440 | consumed tokens: 15068037120 | elapsed time per iteration (s): 1.04 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.130564E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.431 | TFLOPs: 40.56 | 15: iteration 28750/ 125429 | consumed samples: 7360000 | consumed tokens: 15073280000 | elapsed time per iteration (s): 1.02 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.125281E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.810 | TFLOPs: 41.45 | 15: iteration 28760/ 125429 | consumed samples: 7362560 | consumed tokens: 15078522880 | elapsed time per iteration (s): 1.07 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.108389E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.292 | TFLOPs: 39.71 | 15: iteration 28770/ 125429 | consumed samples: 7365120 | consumed tokens: 15083765760 | elapsed time per iteration (s): 1.04 | learning rate: 1.791E-04 | global batch size: 256 | lm loss: 2.104042E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.857 | TFLOPs: 40.63 | 15: iteration 28780/ 125429 | consumed samples: 7367680 | consumed tokens: 15089008640 | elapsed time per iteration (s): 1.05 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.110548E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.089 | TFLOPs: 40.34 | 15: iteration 28790/ 125429 | consumed samples: 7370240 | consumed tokens: 15094251520 | elapsed time per iteration (s): 1.04 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.100785E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.651 | TFLOPs: 40.76 | 15: iteration 28800/ 125429 | consumed samples: 7372800 | consumed tokens: 15099494400 | elapsed time per iteration (s): 1.03 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.106426E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.539 | TFLOPs: 40.91 | 15: iteration 28810/ 125429 | consumed samples: 7375360 | consumed tokens: 15104737280 | elapsed time per iteration (s): 1.07 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.092596E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.158 | TFLOPs: 39.69 | 15: iteration 28820/ 125429 | consumed samples: 7377920 | consumed tokens: 15109980160 | elapsed time per iteration (s): 1.04 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.081631E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.923 | TFLOPs: 40.64 | 15: iteration 28830/ 125429 | consumed samples: 7380480 | consumed tokens: 15115223040 | elapsed time per iteration (s): 1.04 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.108693E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.884 | TFLOPs: 40.63 | 15: iteration 28840/ 125429 | consumed samples: 7383040 | consumed tokens: 15120465920 | elapsed time per iteration (s): 1.09 | learning rate: 1.790E-04 | global batch size: 256 | lm loss: 2.086043E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.659 | TFLOPs: 38.94 | 15: iteration 28850/ 125429 | consumed samples: 7385600 | consumed tokens: 15125708800 | elapsed time per iteration (s): 1.09 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.114319E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.994 | TFLOPs: 38.67 | 15: iteration 28860/ 125429 | consumed samples: 7388160 | consumed tokens: 15130951680 | elapsed time per iteration (s): 1.05 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.098762E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.802 | TFLOPs: 40.12 | 15: iteration 28870/ 125429 | consumed samples: 7390720 | consumed tokens: 15136194560 | elapsed time per iteration (s): 1.13 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.096677E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.569 | TFLOPs: 37.44 | 15: iteration 28880/ 125429 | consumed samples: 7393280 | consumed tokens: 15141437440 | elapsed time per iteration (s): 1.03 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.109917E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.870 | TFLOPs: 41.13 | 15: iteration 28890/ 125429 | consumed samples: 7395840 | consumed tokens: 15146680320 | elapsed time per iteration (s): 1.04 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.109717E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.652 | TFLOPs: 40.60 | 15: iteration 28900/ 125429 | consumed samples: 7398400 | consumed tokens: 15151923200 | elapsed time per iteration (s): 1.03 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.105301E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.642 | TFLOPs: 40.92 | 15: iteration 28910/ 125429 | consumed samples: 7400960 | consumed tokens: 15157166080 | elapsed time per iteration (s): 1.06 | learning rate: 1.789E-04 | global batch size: 256 | lm loss: 2.110628E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.070 | TFLOPs: 40.00 | 15: iteration 28920/ 125429 | consumed samples: 7403520 | consumed tokens: 15162408960 | elapsed time per iteration (s): 1.06 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.119038E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.327 | TFLOPs: 39.88 | 15: iteration 28930/ 125429 | consumed samples: 7406080 | consumed tokens: 15167651840 | elapsed time per iteration (s): 1.09 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.100702E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.171 | TFLOPs: 38.86 | 15: iteration 28940/ 125429 | consumed samples: 7408640 | consumed tokens: 15172894720 | elapsed time per iteration (s): 1.09 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.122678E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.914 | TFLOPs: 38.82 | 15: iteration 28950/ 125429 | consumed samples: 7411200 | consumed tokens: 15178137600 | elapsed time per iteration (s): 1.05 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.120676E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.183 | TFLOPs: 40.35 | 15: iteration 28960/ 125429 | consumed samples: 7413760 | consumed tokens: 15183380480 | elapsed time per iteration (s): 1.03 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.075788E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.648 | TFLOPs: 40.93 | 15: iteration 28970/ 125429 | consumed samples: 7416320 | consumed tokens: 15188623360 | elapsed time per iteration (s): 1.05 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.136416E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.276 | TFLOPs: 40.37 | 15: iteration 28980/ 125429 | consumed samples: 7418880 | consumed tokens: 15193866240 | elapsed time per iteration (s): 1.05 | learning rate: 1.788E-04 | global batch size: 256 | lm loss: 2.130546E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.972 | TFLOPs: 40.32 | 15: iteration 28990/ 125429 | consumed samples: 7421440 | consumed tokens: 15199109120 | elapsed time per iteration (s): 1.09 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.109649E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.872 | TFLOPs: 38.98 | 15: iteration 29000/ 125429 | consumed samples: 7424000 | consumed tokens: 15204352000 | elapsed time per iteration (s): 1.07 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.087017E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.765 | TFLOPs: 39.46 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 29000 | lm loss value: 2.160081E+00 | lm loss PPL: 8.671843E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 29000 to checkpoints_1b5 0: [2022-11-26 04:28:51,743] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step29000 is begin to save! 0: [2022-11-26 04:28:51,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_01-model_00-model_states.pt... 0: [2022-11-26 04:28:51,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_01-model_00-model_states.pt. 0: [2022-11-26 04:28:51,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_03-model_00-model_states.pt... 0: [2022-11-26 04:28:52,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_03-model_00-model_states.pt. 0: [2022-11-26 04:28:52,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_04-model_00-model_states.pt... 0: [2022-11-26 04:28:52,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_04-model_00-model_states.pt. 0: [2022-11-26 04:28:52,217] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_05-model_00-model_states.pt... 0: [2022-11-26 04:28:52,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_05-model_00-model_states.pt. 0: [2022-11-26 04:28:52,329] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_06-model_00-model_states.pt... 0: [2022-11-26 04:28:52,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_06-model_00-model_states.pt. 0: [2022-11-26 04:28:52,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_07-model_00-model_states.pt... 0: [2022-11-26 04:28:52,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_07-model_00-model_states.pt. 0: [2022-11-26 04:28:52,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_08-model_00-model_states.pt... 0: [2022-11-26 04:28:52,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_08-model_00-model_states.pt. 0: [2022-11-26 04:28:52,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_09-model_00-model_states.pt... 0: [2022-11-26 04:28:52,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_09-model_00-model_states.pt. 0: [2022-11-26 04:28:52,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_10-model_00-model_states.pt... 0: [2022-11-26 04:28:52,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_10-model_00-model_states.pt. 0: [2022-11-26 04:28:52,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_11-model_00-model_states.pt... 0: [2022-11-26 04:28:52,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_11-model_00-model_states.pt. 0: [2022-11-26 04:28:52,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_12-model_00-model_states.pt... 0: [2022-11-26 04:28:53,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_12-model_00-model_states.pt. 0: [2022-11-26 04:28:53,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_13-model_00-model_states.pt... 0: [2022-11-26 04:28:53,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_13-model_00-model_states.pt. 0: [2022-11-26 04:28:53,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_14-model_00-model_states.pt... 0: [2022-11-26 04:28:53,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_14-model_00-model_states.pt. 0: [2022-11-26 04:28:53,298] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_15-model_00-model_states.pt... 0: [2022-11-26 04:28:53,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_15-model_00-model_states.pt. 0: [2022-11-26 04:28:53,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_16-model_00-model_states.pt... 0: [2022-11-26 04:28:53,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_16-model_00-model_states.pt. 0: [2022-11-26 04:28:53,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_17-model_00-model_states.pt... 0: [2022-11-26 04:28:53,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_17-model_00-model_states.pt. 0: [2022-11-26 04:28:53,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_18-model_00-model_states.pt... 0: [2022-11-26 04:28:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_18-model_00-model_states.pt. 0: [2022-11-26 04:28:53,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_19-model_00-model_states.pt... 0: [2022-11-26 04:28:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_19-model_00-model_states.pt. 0: [2022-11-26 04:28:53,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_20-model_00-model_states.pt... 0: [2022-11-26 04:28:53,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_20-model_00-model_states.pt. 0: [2022-11-26 04:28:53,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_21-model_00-model_states.pt... 0: [2022-11-26 04:28:54,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_21-model_00-model_states.pt. 0: [2022-11-26 04:28:54,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_22-model_00-model_states.pt... 0: [2022-11-26 04:28:54,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_22-model_00-model_states.pt. 0: [2022-11-26 04:28:54,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_23-model_00-model_states.pt... 0: [2022-11-26 04:28:54,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_23-model_00-model_states.pt. 0: [2022-11-26 04:28:54,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_24-model_00-model_states.pt... 0: [2022-11-26 04:28:54,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_24-model_00-model_states.pt. 0: [2022-11-26 04:28:54,359] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_25-model_00-model_states.pt... 0: [2022-11-26 04:28:54,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_25-model_00-model_states.pt. 0: [2022-11-26 04:28:54,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_26-model_00-model_states.pt... 0: [2022-11-26 04:28:54,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_26-model_00-model_states.pt. 0: [2022-11-26 04:28:54,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_27-model_00-model_states.pt... 0: [2022-11-26 04:28:54,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_27-model_00-model_states.pt. 0: [2022-11-26 04:28:54,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_28-model_00-model_states.pt... 0: [2022-11-26 04:28:54,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_28-model_00-model_states.pt. 0: [2022-11-26 04:28:54,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_29-model_00-model_states.pt... 0: [2022-11-26 04:28:54,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_29-model_00-model_states.pt. 0: [2022-11-26 04:28:54,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_30-model_00-model_states.pt... 0: [2022-11-26 04:28:54,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_30-model_00-model_states.pt. 0: [2022-11-26 04:28:54,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/layer_32-model_00-model_states.pt... 0: [2022-11-26 04:28:55,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/layer_32-model_00-model_states.pt. 0: [2022-11-26 04:28:55,002] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step29000/mp_rank_00_model_states.pt 0: [2022-11-26 04:28:55,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/mp_rank_00_model_states.pt... 0: [2022-11-26 04:28:55,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/mp_rank_00_model_states.pt. 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:28:55,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:28:55,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step29000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:28:55,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:28:55,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:28:55,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 04:28:55,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 04:28:55,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:28:55,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 04:28:55,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 04:28:55,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 04:28:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 04:28:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 04:28:55,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 04:28:55,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:28:55,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:28:55,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:28:55,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:28:55,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:28:55,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 04:28:55,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 04:28:55,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 04:28:55,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:28:55,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:28:55,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 04:28:55,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 04:28:55,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 10: [2022-11-26 04:28:55,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 15: [2022-11-26 04:28:55,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 04:28:55,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 04:28:55,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:28:55,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:28:55,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:28:55,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 15: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 9: [2022-11-26 04:28:55,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 5: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:28:55,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 04:28:55,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:28:55,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:28:55,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 3: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:28:55,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 04:28:55,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 04:28:55,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 12: [2022-11-26 04:28:55,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:28:55,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 04:28:55,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 04:28:55,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 10: [2022-11-26 04:28:55,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:28:55,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 04:28:55,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 14: [2022-11-26 04:28:55,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:28:55,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 04:28:55,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 4: [2022-11-26 04:28:55,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:28:55,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 04:28:55,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 04:28:55,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:28:55,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:28:55,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:28:55,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:28:55,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 2: [2022-11-26 04:28:55,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:28:55,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 04:28:55,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 8: [2022-11-26 04:28:55,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:28:55,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 04:28:55,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:28:55,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 11: [2022-11-26 04:28:55,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:28:55,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 04:28:55,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 1: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 13: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:28:55,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 04:28:55,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 04:28:55,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:28:55,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 7: [2022-11-26 04:28:55,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 04:28:55,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:28:55,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:28:55,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 6: [2022-11-26 04:28:55,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:28:55,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 04:28:55,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: [2022-11-26 04:28:55,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step29000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 04:28:55,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step29000 is ready now! 0: successfully saved checkpoint at iteration 29000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3675.22 15: iteration 29010/ 125429 | consumed samples: 7426560 | consumed tokens: 15209594880 | elapsed time per iteration (s): 1.42 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.064002E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.466 | TFLOPs: 29.82 | 15: iteration 29020/ 125429 | consumed samples: 7429120 | consumed tokens: 15214837760 | elapsed time per iteration (s): 1.05 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.108573E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.842 | TFLOPs: 40.46 | 15: iteration 29030/ 125429 | consumed samples: 7431680 | consumed tokens: 15220080640 | elapsed time per iteration (s): 1.04 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.094256E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.244 | TFLOPs: 40.53 | 15: iteration 29040/ 125429 | consumed samples: 7434240 | consumed tokens: 15225323520 | elapsed time per iteration (s): 1.05 | learning rate: 1.787E-04 | global batch size: 256 | lm loss: 2.087463E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.126 | TFLOPs: 40.34 | 15: iteration 29050/ 125429 | consumed samples: 7436800 | consumed tokens: 15230566400 | elapsed time per iteration (s): 1.11 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.109431E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.406 | TFLOPs: 38.08 | 15: iteration 29060/ 125429 | consumed samples: 7439360 | consumed tokens: 15235809280 | elapsed time per iteration (s): 1.12 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.111696E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.101 | TFLOPs: 37.86 | 15: iteration 29070/ 125429 | consumed samples: 7441920 | consumed tokens: 15241052160 | elapsed time per iteration (s): 1.10 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.087704E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.304 | TFLOPs: 38.39 | 15: iteration 29080/ 125429 | consumed samples: 7444480 | consumed tokens: 15246295040 | elapsed time per iteration (s): 1.05 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.119616E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.453 | TFLOPs: 40.23 | 15: iteration 29090/ 125429 | consumed samples: 7447040 | consumed tokens: 15251537920 | elapsed time per iteration (s): 1.02 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.105943E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.627 | TFLOPs: 41.42 | 15: iteration 29100/ 125429 | consumed samples: 7449600 | consumed tokens: 15256780800 | elapsed time per iteration (s): 1.03 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.074810E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.561 | TFLOPs: 41.08 | 15: iteration 29110/ 125429 | consumed samples: 7452160 | consumed tokens: 15262023680 | elapsed time per iteration (s): 1.04 | learning rate: 1.786E-04 | global batch size: 256 | lm loss: 2.102863E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.565 | TFLOPs: 40.75 | 15: iteration 29120/ 125429 | consumed samples: 7454720 | consumed tokens: 15267266560 | elapsed time per iteration (s): 1.03 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.081681E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.596 | TFLOPs: 41.25 | 15: iteration 29130/ 125429 | consumed samples: 7457280 | consumed tokens: 15272509440 | elapsed time per iteration (s): 1.08 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.099681E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.077 | TFLOPs: 39.34 | 15: iteration 29140/ 125429 | consumed samples: 7459840 | consumed tokens: 15277752320 | elapsed time per iteration (s): 1.05 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.088345E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.254 | TFLOPs: 40.36 | 15: iteration 29150/ 125429 | consumed samples: 7462400 | consumed tokens: 15282995200 | elapsed time per iteration (s): 1.05 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.113693E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.625 | TFLOPs: 40.43 | 15: iteration 29160/ 125429 | consumed samples: 7464960 | consumed tokens: 15288238080 | elapsed time per iteration (s): 1.06 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.094074E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.084 | TFLOPs: 40.01 | 15: iteration 29170/ 125429 | consumed samples: 7467520 | consumed tokens: 15293480960 | elapsed time per iteration (s): 1.02 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.086955E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.040 | TFLOPs: 41.49 | 15: iteration 29180/ 125429 | consumed samples: 7470080 | consumed tokens: 15298723840 | elapsed time per iteration (s): 1.02 | learning rate: 1.785E-04 | global batch size: 256 | lm loss: 2.111037E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.577 | TFLOPs: 41.41 | 15: iteration 29190/ 125429 | consumed samples: 7472640 | consumed tokens: 15303966720 | elapsed time per iteration (s): 1.02 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.097838E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.738 | TFLOPs: 41.44 | 15: iteration 29200/ 125429 | consumed samples: 7475200 | consumed tokens: 15309209600 | elapsed time per iteration (s): 1.03 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.057191E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.457 | TFLOPs: 41.06 | 15: iteration 29210/ 125429 | consumed samples: 7477760 | consumed tokens: 15314452480 | elapsed time per iteration (s): 1.06 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.119886E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.456 | TFLOPs: 39.74 | 15: iteration 29220/ 125429 | consumed samples: 7480320 | consumed tokens: 15319695360 | elapsed time per iteration (s): 1.05 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.101518E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.726 | TFLOPs: 40.44 | 15: iteration 29230/ 125429 | consumed samples: 7482880 | consumed tokens: 15324938240 | elapsed time per iteration (s): 1.06 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.073441E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.279 | TFLOPs: 40.04 | 15: iteration 29240/ 125429 | consumed samples: 7485440 | consumed tokens: 15330181120 | elapsed time per iteration (s): 1.05 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.094785E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.250 | TFLOPs: 40.20 | 15: iteration 29250/ 125429 | consumed samples: 7488000 | consumed tokens: 15335424000 | elapsed time per iteration (s): 1.06 | learning rate: 1.784E-04 | global batch size: 256 | lm loss: 2.075348E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.724 | TFLOPs: 39.78 | 15: iteration 29260/ 125429 | consumed samples: 7490560 | consumed tokens: 15340666880 | elapsed time per iteration (s): 1.07 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.089757E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.982 | TFLOPs: 39.49 | 15: iteration 29270/ 125429 | consumed samples: 7493120 | consumed tokens: 15345909760 | elapsed time per iteration (s): 1.07 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.101218E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.069 | TFLOPs: 39.51 | 15: iteration 29280/ 125429 | consumed samples: 7495680 | consumed tokens: 15351152640 | elapsed time per iteration (s): 1.07 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.109737E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.459 | TFLOPs: 39.57 | 15: iteration 29290/ 125429 | consumed samples: 7498240 | consumed tokens: 15356395520 | elapsed time per iteration (s): 1.11 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.092087E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.652 | TFLOPs: 38.12 | 15: iteration 29300/ 125429 | consumed samples: 7500800 | consumed tokens: 15361638400 | elapsed time per iteration (s): 1.02 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.105746E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.393 | TFLOPs: 41.38 | 15: iteration 29310/ 125429 | consumed samples: 7503360 | consumed tokens: 15366881280 | elapsed time per iteration (s): 1.08 | learning rate: 1.783E-04 | global batch size: 256 | lm loss: 2.074914E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.924 | TFLOPs: 39.32 | 15: iteration 29320/ 125429 | consumed samples: 7505920 | consumed tokens: 15372124160 | elapsed time per iteration (s): 1.04 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.121321E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.232 | TFLOPs: 40.86 | 15: iteration 29330/ 125429 | consumed samples: 7508480 | consumed tokens: 15377367040 | elapsed time per iteration (s): 1.06 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.095448E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.945 | TFLOPs: 39.82 | 15: iteration 29340/ 125429 | consumed samples: 7511040 | consumed tokens: 15382609920 | elapsed time per iteration (s): 1.04 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.074226E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.226 | TFLOPs: 40.86 | 15: iteration 29350/ 125429 | consumed samples: 7513600 | consumed tokens: 15387852800 | elapsed time per iteration (s): 1.07 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.079922E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.846 | TFLOPs: 39.64 | 15: iteration 29360/ 125429 | consumed samples: 7516160 | consumed tokens: 15393095680 | elapsed time per iteration (s): 1.04 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.111495E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.621 | TFLOPs: 40.76 | 15: iteration 29370/ 125429 | consumed samples: 7518720 | consumed tokens: 15398338560 | elapsed time per iteration (s): 1.03 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.089989E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.790 | TFLOPs: 40.95 | 15: iteration 29380/ 125429 | consumed samples: 7521280 | consumed tokens: 15403581440 | elapsed time per iteration (s): 1.03 | learning rate: 1.782E-04 | global batch size: 256 | lm loss: 2.113601E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.211 | TFLOPs: 41.18 | 15: iteration 29390/ 125429 | consumed samples: 7523840 | consumed tokens: 15408824320 | elapsed time per iteration (s): 1.04 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.095062E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.254 | TFLOPs: 40.53 | 15: iteration 29400/ 125429 | consumed samples: 7526400 | consumed tokens: 15414067200 | elapsed time per iteration (s): 1.09 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.121411E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.830 | TFLOPs: 38.81 | 15: iteration 29410/ 125429 | consumed samples: 7528960 | consumed tokens: 15419310080 | elapsed time per iteration (s): 1.07 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.055115E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.077 | TFLOPs: 39.67 | 15: iteration 29420/ 125429 | consumed samples: 7531520 | consumed tokens: 15424552960 | elapsed time per iteration (s): 1.04 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.060229E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.878 | TFLOPs: 40.63 | 15: iteration 29430/ 125429 | consumed samples: 7534080 | consumed tokens: 15429795840 | elapsed time per iteration (s): 1.11 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.118145E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.678 | TFLOPs: 38.12 | 15: iteration 29440/ 125429 | consumed samples: 7536640 | consumed tokens: 15435038720 | elapsed time per iteration (s): 1.06 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.100502E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.210 | TFLOPs: 39.86 | 15: iteration 29450/ 125429 | consumed samples: 7539200 | consumed tokens: 15440281600 | elapsed time per iteration (s): 1.05 | learning rate: 1.781E-04 | global batch size: 256 | lm loss: 2.092926E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.007 | TFLOPs: 40.32 | 15: iteration 29460/ 125429 | consumed samples: 7541760 | consumed tokens: 15445524480 | elapsed time per iteration (s): 1.12 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.120429E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.187 | TFLOPs: 37.71 | 15: iteration 29470/ 125429 | consumed samples: 7544320 | consumed tokens: 15450767360 | elapsed time per iteration (s): 1.06 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.109561E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.090 | TFLOPs: 40.01 | 15: iteration 29480/ 125429 | consumed samples: 7546880 | consumed tokens: 15456010240 | elapsed time per iteration (s): 1.05 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.112984E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.019 | TFLOPs: 40.33 | 15: iteration 29490/ 125429 | consumed samples: 7549440 | consumed tokens: 15461253120 | elapsed time per iteration (s): 1.03 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.089256E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.313 | TFLOPs: 41.20 | 15: iteration 29500/ 125429 | consumed samples: 7552000 | consumed tokens: 15466496000 | elapsed time per iteration (s): 1.07 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.115363E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.326 | TFLOPs: 39.55 | 15: iteration 29510/ 125429 | consumed samples: 7554560 | consumed tokens: 15471738880 | elapsed time per iteration (s): 1.04 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.111837E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.728 | TFLOPs: 40.77 | 15: iteration 29520/ 125429 | consumed samples: 7557120 | consumed tokens: 15476981760 | elapsed time per iteration (s): 1.05 | learning rate: 1.780E-04 | global batch size: 256 | lm loss: 2.108880E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.598 | TFLOPs: 40.42 | 15: iteration 29530/ 125429 | consumed samples: 7559680 | consumed tokens: 15482224640 | elapsed time per iteration (s): 1.03 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.081923E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.744 | TFLOPs: 41.11 | 15: iteration 29540/ 125429 | consumed samples: 7562240 | consumed tokens: 15487467520 | elapsed time per iteration (s): 1.03 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.111872E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.288 | TFLOPs: 41.20 | 15: iteration 29550/ 125429 | consumed samples: 7564800 | consumed tokens: 15492710400 | elapsed time per iteration (s): 1.08 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.077110E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.951 | TFLOPs: 39.32 | 15: iteration 29560/ 125429 | consumed samples: 7567360 | consumed tokens: 15497953280 | elapsed time per iteration (s): 1.04 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.087836E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.879 | TFLOPs: 40.63 | 15: iteration 29570/ 125429 | consumed samples: 7569920 | consumed tokens: 15503196160 | elapsed time per iteration (s): 1.03 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.087144E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.382 | TFLOPs: 41.05 | 15: iteration 29580/ 125429 | consumed samples: 7572480 | consumed tokens: 15508439040 | elapsed time per iteration (s): 1.06 | learning rate: 1.779E-04 | global batch size: 256 | lm loss: 2.108156E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.737 | TFLOPs: 39.95 | 15: iteration 29590/ 125429 | consumed samples: 7575040 | consumed tokens: 15513681920 | elapsed time per iteration (s): 1.04 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.081049E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.551 | TFLOPs: 40.58 | 15: iteration 29600/ 125429 | consumed samples: 7577600 | consumed tokens: 15518924800 | elapsed time per iteration (s): 1.11 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.122894E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.950 | TFLOPs: 38.00 | 15: iteration 29610/ 125429 | consumed samples: 7580160 | consumed tokens: 15524167680 | elapsed time per iteration (s): 1.09 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.079303E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.566 | TFLOPs: 38.93 | 15: iteration 29620/ 125429 | consumed samples: 7582720 | consumed tokens: 15529410560 | elapsed time per iteration (s): 1.11 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.100599E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.125 | TFLOPs: 38.03 | 15: iteration 29630/ 125429 | consumed samples: 7585280 | consumed tokens: 15534653440 | elapsed time per iteration (s): 1.06 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.077132E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.610 | TFLOPs: 39.76 | 15: iteration 29640/ 125429 | consumed samples: 7587840 | consumed tokens: 15539896320 | elapsed time per iteration (s): 1.03 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.098126E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.429 | TFLOPs: 41.05 | 15: iteration 29650/ 125429 | consumed samples: 7590400 | consumed tokens: 15545139200 | elapsed time per iteration (s): 1.08 | learning rate: 1.778E-04 | global batch size: 256 | lm loss: 2.081989E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.233 | TFLOPs: 39.04 | 15: iteration 29660/ 125429 | consumed samples: 7592960 | consumed tokens: 15550382080 | elapsed time per iteration (s): 1.10 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.096924E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.092 | TFLOPs: 38.36 | 15: iteration 29670/ 125429 | consumed samples: 7595520 | consumed tokens: 15555624960 | elapsed time per iteration (s): 1.09 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.095640E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.564 | TFLOPs: 38.93 | 15: iteration 29680/ 125429 | consumed samples: 7598080 | consumed tokens: 15560867840 | elapsed time per iteration (s): 1.03 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.113414E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.278 | TFLOPs: 41.20 | 15: iteration 29690/ 125429 | consumed samples: 7600640 | consumed tokens: 15566110720 | elapsed time per iteration (s): 1.04 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.067001E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.236 | TFLOPs: 40.53 | 15: iteration 29700/ 125429 | consumed samples: 7603200 | consumed tokens: 15571353600 | elapsed time per iteration (s): 1.09 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.087791E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.237 | TFLOPs: 38.71 | 15: iteration 29710/ 125429 | consumed samples: 7605760 | consumed tokens: 15576596480 | elapsed time per iteration (s): 1.05 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.090518E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.736 | TFLOPs: 40.11 | 15: iteration 29720/ 125429 | consumed samples: 7608320 | consumed tokens: 15581839360 | elapsed time per iteration (s): 1.03 | learning rate: 1.777E-04 | global batch size: 256 | lm loss: 2.092488E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.892 | TFLOPs: 40.97 | 15: iteration 29730/ 125429 | consumed samples: 7610880 | consumed tokens: 15587082240 | elapsed time per iteration (s): 1.04 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.112822E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.160 | TFLOPs: 40.68 | 15: iteration 29740/ 125429 | consumed samples: 7613440 | consumed tokens: 15592325120 | elapsed time per iteration (s): 1.06 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.100557E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.248 | TFLOPs: 40.03 | 15: iteration 29750/ 125429 | consumed samples: 7616000 | consumed tokens: 15597568000 | elapsed time per iteration (s): 1.03 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.091945E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.527 | TFLOPs: 40.91 | 15: iteration 29760/ 125429 | consumed samples: 7618560 | consumed tokens: 15602810880 | elapsed time per iteration (s): 1.04 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.078627E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.534 | TFLOPs: 40.58 | 15: iteration 29770/ 125429 | consumed samples: 7621120 | consumed tokens: 15608053760 | elapsed time per iteration (s): 1.04 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.096548E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.307 | TFLOPs: 40.87 | 15: iteration 29780/ 125429 | consumed samples: 7623680 | consumed tokens: 15613296640 | elapsed time per iteration (s): 1.04 | learning rate: 1.776E-04 | global batch size: 256 | lm loss: 2.052346E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.168 | TFLOPs: 40.68 | 15: iteration 29790/ 125429 | consumed samples: 7626240 | consumed tokens: 15618539520 | elapsed time per iteration (s): 1.07 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.099220E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.274 | TFLOPs: 39.38 | 15: iteration 29800/ 125429 | consumed samples: 7628800 | consumed tokens: 15623782400 | elapsed time per iteration (s): 1.04 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.119087E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.735 | TFLOPs: 40.61 | 15: iteration 29810/ 125429 | consumed samples: 7631360 | consumed tokens: 15629025280 | elapsed time per iteration (s): 1.04 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.079995E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.996 | TFLOPs: 40.82 | 15: iteration 29820/ 125429 | consumed samples: 7633920 | consumed tokens: 15634268160 | elapsed time per iteration (s): 1.03 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.080748E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.371 | TFLOPs: 41.21 | 15: iteration 29830/ 125429 | consumed samples: 7636480 | consumed tokens: 15639511040 | elapsed time per iteration (s): 1.09 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.106889E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.851 | TFLOPs: 38.98 | 15: iteration 29840/ 125429 | consumed samples: 7639040 | consumed tokens: 15644753920 | elapsed time per iteration (s): 1.03 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.071904E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.422 | TFLOPs: 41.22 | 15: iteration 29850/ 125429 | consumed samples: 7641600 | consumed tokens: 15649996800 | elapsed time per iteration (s): 1.03 | learning rate: 1.775E-04 | global batch size: 256 | lm loss: 2.065887E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.009 | TFLOPs: 40.99 | 15: iteration 29860/ 125429 | consumed samples: 7644160 | consumed tokens: 15655239680 | elapsed time per iteration (s): 1.03 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.082920E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.666 | TFLOPs: 41.26 | 15: iteration 29870/ 125429 | consumed samples: 7646720 | consumed tokens: 15660482560 | elapsed time per iteration (s): 1.06 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.074697E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.844 | TFLOPs: 39.97 | 15: iteration 29880/ 125429 | consumed samples: 7649280 | consumed tokens: 15665725440 | elapsed time per iteration (s): 1.03 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.092296E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.807 | TFLOPs: 40.95 | 15: iteration 29890/ 125429 | consumed samples: 7651840 | consumed tokens: 15670968320 | elapsed time per iteration (s): 1.05 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.092898E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.519 | TFLOPs: 40.24 | 15: iteration 29900/ 125429 | consumed samples: 7654400 | consumed tokens: 15676211200 | elapsed time per iteration (s): 1.03 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.111090E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.524 | TFLOPs: 40.91 | 15: iteration 29910/ 125429 | consumed samples: 7656960 | consumed tokens: 15681454080 | elapsed time per iteration (s): 1.04 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.087227E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.840 | TFLOPs: 40.63 | 15: iteration 29920/ 125429 | consumed samples: 7659520 | consumed tokens: 15686696960 | elapsed time per iteration (s): 1.04 | learning rate: 1.774E-04 | global batch size: 256 | lm loss: 2.095909E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.977 | TFLOPs: 40.65 | 15: iteration 29930/ 125429 | consumed samples: 7662080 | consumed tokens: 15691939840 | elapsed time per iteration (s): 1.04 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.078385E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.363 | TFLOPs: 40.55 | 15: iteration 29940/ 125429 | consumed samples: 7664640 | consumed tokens: 15697182720 | elapsed time per iteration (s): 1.03 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.073916E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.375 | TFLOPs: 41.21 | 15: iteration 29950/ 125429 | consumed samples: 7667200 | consumed tokens: 15702425600 | elapsed time per iteration (s): 1.06 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.097559E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.855 | TFLOPs: 39.80 | 15: iteration 29960/ 125429 | consumed samples: 7669760 | consumed tokens: 15707668480 | elapsed time per iteration (s): 1.05 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.092973E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.888 | TFLOPs: 40.30 | 15: iteration 29970/ 125429 | consumed samples: 7672320 | consumed tokens: 15712911360 | elapsed time per iteration (s): 1.02 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.078174E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.175 | TFLOPs: 41.34 | 15: iteration 29980/ 125429 | consumed samples: 7674880 | consumed tokens: 15718154240 | elapsed time per iteration (s): 1.04 | learning rate: 1.773E-04 | global batch size: 256 | lm loss: 2.083245E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.214 | TFLOPs: 40.85 | 15: iteration 29990/ 125429 | consumed samples: 7677440 | consumed tokens: 15723397120 | elapsed time per iteration (s): 1.06 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.100433E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.052 | TFLOPs: 39.84 | 0: [2022-11-26 04:46:28,234] [INFO] [logging.py:68:log_dist] [Rank 0] step=30000, skipped=0, lr=[0.00017722990808044156, 0.00017722990808044156, 0.00017722990808044156], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 30000/ 125429 | consumed samples: 7680000 | consumed tokens: 15728640000 | elapsed time per iteration (s): 1.04 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.106685E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.780 | TFLOPs: 40.62 | 0: steps: 30000 loss: 2.0819 iter time (s): 1.049 samples/sec: 243.953 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 30000 | lm loss value: 2.136947E+00 | lm loss PPL: 8.473526E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 30000 to checkpoints_1b5 0: [2022-11-26 04:46:28,580] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step30000 is begin to save! 0: [2022-11-26 04:46:28,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_01-model_00-model_states.pt... 0: [2022-11-26 04:46:28,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_01-model_00-model_states.pt. 0: [2022-11-26 04:46:28,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_03-model_00-model_states.pt... 0: [2022-11-26 04:46:28,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_03-model_00-model_states.pt. 0: [2022-11-26 04:46:28,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_04-model_00-model_states.pt... 0: [2022-11-26 04:46:29,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_04-model_00-model_states.pt. 0: [2022-11-26 04:46:29,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_05-model_00-model_states.pt... 0: [2022-11-26 04:46:29,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_05-model_00-model_states.pt. 0: [2022-11-26 04:46:29,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_06-model_00-model_states.pt... 0: [2022-11-26 04:46:29,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_06-model_00-model_states.pt. 0: [2022-11-26 04:46:29,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_07-model_00-model_states.pt... 0: [2022-11-26 04:46:29,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_07-model_00-model_states.pt. 0: [2022-11-26 04:46:29,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_08-model_00-model_states.pt... 0: [2022-11-26 04:46:29,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_08-model_00-model_states.pt. 0: [2022-11-26 04:46:29,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_09-model_00-model_states.pt... 0: [2022-11-26 04:46:29,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_09-model_00-model_states.pt. 0: [2022-11-26 04:46:29,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_10-model_00-model_states.pt... 0: [2022-11-26 04:46:29,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_10-model_00-model_states.pt. 0: [2022-11-26 04:46:29,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_11-model_00-model_states.pt... 0: [2022-11-26 04:46:29,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_11-model_00-model_states.pt. 0: [2022-11-26 04:46:29,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_12-model_00-model_states.pt... 0: [2022-11-26 04:46:29,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_12-model_00-model_states.pt. 0: [2022-11-26 04:46:29,969] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_13-model_00-model_states.pt... 0: [2022-11-26 04:46:30,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_13-model_00-model_states.pt. 0: [2022-11-26 04:46:30,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_14-model_00-model_states.pt... 0: [2022-11-26 04:46:30,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_14-model_00-model_states.pt. 0: [2022-11-26 04:46:30,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_15-model_00-model_states.pt... 0: [2022-11-26 04:46:30,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_15-model_00-model_states.pt. 0: [2022-11-26 04:46:30,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_16-model_00-model_states.pt... 0: [2022-11-26 04:46:30,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_16-model_00-model_states.pt. 0: [2022-11-26 04:46:30,392] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_17-model_00-model_states.pt... 0: [2022-11-26 04:46:30,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_17-model_00-model_states.pt. 0: [2022-11-26 04:46:30,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_18-model_00-model_states.pt... 0: [2022-11-26 04:46:30,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_18-model_00-model_states.pt. 0: [2022-11-26 04:46:30,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_19-model_00-model_states.pt... 0: [2022-11-26 04:46:30,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_19-model_00-model_states.pt. 0: [2022-11-26 04:46:30,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_20-model_00-model_states.pt... 0: [2022-11-26 04:46:30,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_20-model_00-model_states.pt. 0: [2022-11-26 04:46:30,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_21-model_00-model_states.pt... 0: [2022-11-26 04:46:30,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_21-model_00-model_states.pt. 0: [2022-11-26 04:46:30,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_22-model_00-model_states.pt... 0: [2022-11-26 04:46:31,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_22-model_00-model_states.pt. 0: [2022-11-26 04:46:31,023] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_23-model_00-model_states.pt... 0: [2022-11-26 04:46:31,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_23-model_00-model_states.pt. 0: [2022-11-26 04:46:31,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_24-model_00-model_states.pt... 0: [2022-11-26 04:46:31,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_24-model_00-model_states.pt. 0: [2022-11-26 04:46:31,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_25-model_00-model_states.pt... 0: [2022-11-26 04:46:31,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_25-model_00-model_states.pt. 0: [2022-11-26 04:46:31,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_26-model_00-model_states.pt... 0: [2022-11-26 04:46:31,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_26-model_00-model_states.pt. 0: [2022-11-26 04:46:31,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_27-model_00-model_states.pt... 0: [2022-11-26 04:46:31,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_27-model_00-model_states.pt. 0: [2022-11-26 04:46:31,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_28-model_00-model_states.pt... 0: [2022-11-26 04:46:31,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_28-model_00-model_states.pt. 0: [2022-11-26 04:46:31,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_29-model_00-model_states.pt... 0: [2022-11-26 04:46:31,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_29-model_00-model_states.pt. 0: [2022-11-26 04:46:31,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_30-model_00-model_states.pt... 0: [2022-11-26 04:46:31,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_30-model_00-model_states.pt. 0: [2022-11-26 04:46:31,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/layer_32-model_00-model_states.pt... 0: [2022-11-26 04:46:31,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/layer_32-model_00-model_states.pt. 0: [2022-11-26 04:46:31,870] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step30000/mp_rank_00_model_states.pt 0: [2022-11-26 04:46:31,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/mp_rank_00_model_states.pt... 0: [2022-11-26 04:46:31,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/mp_rank_00_model_states.pt. 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 1: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 15: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 2: [2022-11-26 04:46:31,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-26 04:46:32,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 04:46:32,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 04:46:32,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 04:46:32,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 04:46:32,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:46:32,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:46:32,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 04:46:32,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 04:46:32,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 04:46:32,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 04:46:32,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 04:46:32,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 04:46:32,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 04:46:32,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:46:32,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 04:46:32,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 04:46:32,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 04:46:32,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 04:46:32,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 04:46:32,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 9: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 04:46:32,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 2: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 7: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 04:46:32,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 04:46:32,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 04:46:32,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 12: [2022-11-26 04:46:32,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 04:46:32,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 04:46:32,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 2: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 04:46:32,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 04:46:32,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 04:46:32,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 04:46:32,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 1: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 3: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 04:46:32,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:46:32,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 04:46:32,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 04:46:32,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 4: [2022-11-26 04:46:32,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:46:32,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 04:46:32,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:46:32,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:46:32,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:46:32,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 04:46:32,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 04:46:32,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 04:46:32,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:46:32,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 13: [2022-11-26 04:46:32,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 04:46:32,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 04:46:32,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 7: [2022-11-26 04:46:32,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 7: [2022-11-26 04:46:32,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:46:32,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:46:32,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 04:46:32,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:46:32,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:46:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:46:32,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:46:32,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 11: [2022-11-26 04:46:32,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 04:46:32,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 04:46:32,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 04:46:32,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:46:32,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 6: [2022-11-26 04:46:32,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 04:46:32,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 04:46:32,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 10: [2022-11-26 04:46:32,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 04:46:32,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 04:46:32,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 04:46:32,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 04:46:32,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 04:46:32,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 5: [2022-11-26 04:46:32,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 14: [2022-11-26 04:46:32,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 04:46:32,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 04:46:32,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: [2022-11-26 04:46:32,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 04:46:32,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 0: successfully saved checkpoint at iteration 30000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3584.69 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step30000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 8: [2022-11-26 04:46:32,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step30000 is ready now! 15: iteration 30010/ 125429 | consumed samples: 7682560 | consumed tokens: 15733882880 | elapsed time per iteration (s): 1.48 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.084249E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.374 | TFLOPs: 28.65 | 15: iteration 30020/ 125429 | consumed samples: 7685120 | consumed tokens: 15739125760 | elapsed time per iteration (s): 1.02 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.104884E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.140 | TFLOPs: 41.34 | 15: iteration 30030/ 125429 | consumed samples: 7687680 | consumed tokens: 15744368640 | elapsed time per iteration (s): 1.06 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.105065E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.562 | TFLOPs: 39.75 | 15: iteration 30040/ 125429 | consumed samples: 7690240 | consumed tokens: 15749611520 | elapsed time per iteration (s): 1.04 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.081207E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.353 | TFLOPs: 40.55 | 15: iteration 30050/ 125429 | consumed samples: 7692800 | consumed tokens: 15754854400 | elapsed time per iteration (s): 1.03 | learning rate: 1.772E-04 | global batch size: 256 | lm loss: 2.101670E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.686 | TFLOPs: 41.26 | 15: iteration 30060/ 125429 | consumed samples: 7695360 | consumed tokens: 15760097280 | elapsed time per iteration (s): 1.02 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.105371E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.041 | TFLOPs: 41.32 | 15: iteration 30070/ 125429 | consumed samples: 7697920 | consumed tokens: 15765340160 | elapsed time per iteration (s): 1.03 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.097577E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.407 | TFLOPs: 41.05 | 15: iteration 30080/ 125429 | consumed samples: 7700480 | consumed tokens: 15770583040 | elapsed time per iteration (s): 1.05 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.108323E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.974 | TFLOPs: 40.48 | 15: iteration 30090/ 125429 | consumed samples: 7703040 | consumed tokens: 15775825920 | elapsed time per iteration (s): 1.03 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.083419E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.304 | TFLOPs: 41.20 | 15: iteration 30100/ 125429 | consumed samples: 7705600 | consumed tokens: 15781068800 | elapsed time per iteration (s): 1.07 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.076499E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.398 | TFLOPs: 39.40 | 15: iteration 30110/ 125429 | consumed samples: 7708160 | consumed tokens: 15786311680 | elapsed time per iteration (s): 1.03 | learning rate: 1.771E-04 | global batch size: 256 | lm loss: 2.105984E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.923 | TFLOPs: 41.14 | 15: iteration 30120/ 125429 | consumed samples: 7710720 | consumed tokens: 15791554560 | elapsed time per iteration (s): 1.03 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.081815E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.443 | TFLOPs: 41.22 | 15: iteration 30130/ 125429 | consumed samples: 7713280 | consumed tokens: 15796797440 | elapsed time per iteration (s): 1.06 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.093941E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.753 | TFLOPs: 39.95 | 15: iteration 30140/ 125429 | consumed samples: 7715840 | consumed tokens: 15802040320 | elapsed time per iteration (s): 1.03 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.096038E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.855 | TFLOPs: 41.13 | 15: iteration 30150/ 125429 | consumed samples: 7718400 | consumed tokens: 15807283200 | elapsed time per iteration (s): 1.04 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.096029E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.111 | TFLOPs: 40.84 | 15: iteration 30160/ 125429 | consumed samples: 7720960 | consumed tokens: 15812526080 | elapsed time per iteration (s): 1.03 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.101960E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.278 | TFLOPs: 41.03 | 15: iteration 30170/ 125429 | consumed samples: 7723520 | consumed tokens: 15817768960 | elapsed time per iteration (s): 1.02 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.088143E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.521 | TFLOPs: 41.57 | 15: iteration 30180/ 125429 | consumed samples: 7726080 | consumed tokens: 15823011840 | elapsed time per iteration (s): 1.02 | learning rate: 1.770E-04 | global batch size: 256 | lm loss: 2.095832E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.258 | TFLOPs: 41.52 | 15: iteration 30190/ 125429 | consumed samples: 7728640 | consumed tokens: 15828254720 | elapsed time per iteration (s): 1.08 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.081384E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.234 | TFLOPs: 39.20 | 15: iteration 30200/ 125429 | consumed samples: 7731200 | consumed tokens: 15833497600 | elapsed time per iteration (s): 1.06 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.098901E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.006 | TFLOPs: 39.99 | 15: iteration 30210/ 125429 | consumed samples: 7733760 | consumed tokens: 15838740480 | elapsed time per iteration (s): 1.04 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.075135E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.764 | TFLOPs: 40.61 | 15: iteration 30220/ 125429 | consumed samples: 7736320 | consumed tokens: 15843983360 | elapsed time per iteration (s): 1.04 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.109459E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.363 | TFLOPs: 40.71 | 15: iteration 30230/ 125429 | consumed samples: 7738880 | consumed tokens: 15849226240 | elapsed time per iteration (s): 1.07 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.099597E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.216 | TFLOPs: 39.53 | 15: iteration 30240/ 125429 | consumed samples: 7741440 | consumed tokens: 15854469120 | elapsed time per iteration (s): 1.04 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.071525E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.932 | TFLOPs: 40.64 | 15: iteration 30250/ 125429 | consumed samples: 7744000 | consumed tokens: 15859712000 | elapsed time per iteration (s): 1.03 | learning rate: 1.769E-04 | global batch size: 256 | lm loss: 2.127185E+00 | grad norm: 0.253 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.601 | TFLOPs: 40.92 | 15: iteration 30260/ 125429 | consumed samples: 7746560 | consumed tokens: 15864954880 | elapsed time per iteration (s): 1.04 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.121598E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.473 | TFLOPs: 40.73 | 15: iteration 30270/ 125429 | consumed samples: 7749120 | consumed tokens: 15870197760 | elapsed time per iteration (s): 1.05 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.102514E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.804 | TFLOPs: 40.29 | 15: iteration 30280/ 125429 | consumed samples: 7751680 | consumed tokens: 15875440640 | elapsed time per iteration (s): 1.07 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.116405E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.745 | TFLOPs: 39.62 | 15: iteration 30290/ 125429 | consumed samples: 7754240 | consumed tokens: 15880683520 | elapsed time per iteration (s): 1.07 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.118075E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.783 | TFLOPs: 39.46 | 15: iteration 30300/ 125429 | consumed samples: 7756800 | consumed tokens: 15885926400 | elapsed time per iteration (s): 1.05 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.072839E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.851 | TFLOPs: 40.30 | 15: iteration 30310/ 125429 | consumed samples: 7759360 | consumed tokens: 15891169280 | elapsed time per iteration (s): 1.05 | learning rate: 1.768E-04 | global batch size: 256 | lm loss: 2.074200E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.845 | TFLOPs: 40.30 | 15: iteration 30320/ 125429 | consumed samples: 7761920 | consumed tokens: 15896412160 | elapsed time per iteration (s): 1.04 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.112569E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.205 | TFLOPs: 40.85 | 15: iteration 30330/ 125429 | consumed samples: 7764480 | consumed tokens: 15901655040 | elapsed time per iteration (s): 1.03 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.087006E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.361 | TFLOPs: 41.21 | 15: iteration 30340/ 125429 | consumed samples: 7767040 | consumed tokens: 15906897920 | elapsed time per iteration (s): 1.07 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.096375E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.244 | TFLOPs: 39.54 | 15: iteration 30350/ 125429 | consumed samples: 7769600 | consumed tokens: 15912140800 | elapsed time per iteration (s): 1.04 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.074086E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.033 | TFLOPs: 40.49 | 15: iteration 30360/ 125429 | consumed samples: 7772160 | consumed tokens: 15917383680 | elapsed time per iteration (s): 1.02 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.109799E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.341 | TFLOPs: 41.37 | 15: iteration 30370/ 125429 | consumed samples: 7774720 | consumed tokens: 15922626560 | elapsed time per iteration (s): 1.06 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.090630E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.494 | TFLOPs: 39.91 | 15: iteration 30380/ 125429 | consumed samples: 7777280 | consumed tokens: 15927869440 | elapsed time per iteration (s): 1.03 | learning rate: 1.767E-04 | global batch size: 256 | lm loss: 2.115107E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.816 | TFLOPs: 40.95 | 15: iteration 30390/ 125429 | consumed samples: 7779840 | consumed tokens: 15933112320 | elapsed time per iteration (s): 1.04 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.115468E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.610 | TFLOPs: 40.75 | 15: iteration 30400/ 125429 | consumed samples: 7782400 | consumed tokens: 15938355200 | elapsed time per iteration (s): 1.05 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.101888E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.647 | TFLOPs: 40.43 | 15: iteration 30410/ 125429 | consumed samples: 7784960 | consumed tokens: 15943598080 | elapsed time per iteration (s): 1.03 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.076179E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.779 | TFLOPs: 41.11 | 15: iteration 30420/ 125429 | consumed samples: 7787520 | consumed tokens: 15948840960 | elapsed time per iteration (s): 1.13 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.083189E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.352 | TFLOPs: 37.57 | 15: iteration 30430/ 125429 | consumed samples: 7790080 | consumed tokens: 15954083840 | elapsed time per iteration (s): 1.03 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.076915E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.357 | TFLOPs: 40.88 | 15: iteration 30440/ 125429 | consumed samples: 7792640 | consumed tokens: 15959326720 | elapsed time per iteration (s): 1.07 | learning rate: 1.766E-04 | global batch size: 256 | lm loss: 2.110422E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.318 | TFLOPs: 39.71 | 15: iteration 30450/ 125429 | consumed samples: 7795200 | consumed tokens: 15964569600 | elapsed time per iteration (s): 1.03 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.074253E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.122 | TFLOPs: 41.17 | 15: iteration 30460/ 125429 | consumed samples: 7797760 | consumed tokens: 15969812480 | elapsed time per iteration (s): 1.05 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.070690E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.136 | TFLOPs: 40.35 | 15: iteration 30470/ 125429 | consumed samples: 7800320 | consumed tokens: 15975055360 | elapsed time per iteration (s): 1.02 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.101581E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.929 | TFLOPs: 41.30 | 15: iteration 30480/ 125429 | consumed samples: 7802880 | consumed tokens: 15980298240 | elapsed time per iteration (s): 1.04 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.085666E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.520 | TFLOPs: 40.74 | 15: iteration 30490/ 125429 | consumed samples: 7805440 | consumed tokens: 15985541120 | elapsed time per iteration (s): 1.04 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.099729E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.177 | TFLOPs: 40.52 | 15: iteration 30500/ 125429 | consumed samples: 7808000 | consumed tokens: 15990784000 | elapsed time per iteration (s): 1.02 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.074669E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.401 | TFLOPs: 41.55 | 15: iteration 30510/ 125429 | consumed samples: 7810560 | consumed tokens: 15996026880 | elapsed time per iteration (s): 1.03 | learning rate: 1.765E-04 | global batch size: 256 | lm loss: 2.097039E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.423 | TFLOPs: 41.05 | 15: iteration 30520/ 125429 | consumed samples: 7813120 | consumed tokens: 16001269760 | elapsed time per iteration (s): 1.06 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.089404E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.056 | TFLOPs: 39.84 | 15: iteration 30530/ 125429 | consumed samples: 7815680 | consumed tokens: 16006512640 | elapsed time per iteration (s): 1.14 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.079829E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.889 | TFLOPs: 37.16 | 15: iteration 30540/ 125429 | consumed samples: 7818240 | consumed tokens: 16011755520 | elapsed time per iteration (s): 1.05 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.060592E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.627 | TFLOPs: 40.26 | 15: iteration 30550/ 125429 | consumed samples: 7820800 | consumed tokens: 16016998400 | elapsed time per iteration (s): 1.05 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.106391E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.172 | TFLOPs: 40.35 | 15: iteration 30560/ 125429 | consumed samples: 7823360 | consumed tokens: 16022241280 | elapsed time per iteration (s): 1.05 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.107101E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.950 | TFLOPs: 40.48 | 15: iteration 30570/ 125429 | consumed samples: 7825920 | consumed tokens: 16027484160 | elapsed time per iteration (s): 1.09 | learning rate: 1.764E-04 | global batch size: 256 | lm loss: 2.123920E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.973 | TFLOPs: 38.67 | 15: iteration 30580/ 125429 | consumed samples: 7828480 | consumed tokens: 16032727040 | elapsed time per iteration (s): 1.06 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.054058E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.357 | TFLOPs: 40.05 | 15: iteration 30590/ 125429 | consumed samples: 7831040 | consumed tokens: 16037969920 | elapsed time per iteration (s): 1.39 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.098332E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 184.608 | TFLOPs: 30.51 | 15: iteration 30600/ 125429 | consumed samples: 7833600 | consumed tokens: 16043212800 | elapsed time per iteration (s): 1.04 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.096399E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.479 | TFLOPs: 40.57 | 15: iteration 30610/ 125429 | consumed samples: 7836160 | consumed tokens: 16048455680 | elapsed time per iteration (s): 1.03 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.088530E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.401 | TFLOPs: 40.88 | 15: iteration 30620/ 125429 | consumed samples: 7838720 | consumed tokens: 16053698560 | elapsed time per iteration (s): 1.13 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.092880E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.774 | TFLOPs: 37.31 | 15: iteration 30630/ 125429 | consumed samples: 7841280 | consumed tokens: 16058941440 | elapsed time per iteration (s): 1.10 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.073292E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.544 | TFLOPs: 38.43 | 15: iteration 30640/ 125429 | consumed samples: 7843840 | consumed tokens: 16064184320 | elapsed time per iteration (s): 1.03 | learning rate: 1.763E-04 | global batch size: 256 | lm loss: 2.066543E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.683 | TFLOPs: 40.93 | 15: iteration 30650/ 125429 | consumed samples: 7846400 | consumed tokens: 16069427200 | elapsed time per iteration (s): 1.06 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.098368E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.217 | TFLOPs: 40.03 | 15: iteration 30660/ 125429 | consumed samples: 7848960 | consumed tokens: 16074670080 | elapsed time per iteration (s): 1.06 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.089351E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.307 | TFLOPs: 39.88 | 15: iteration 30670/ 125429 | consumed samples: 7851520 | consumed tokens: 16079912960 | elapsed time per iteration (s): 1.06 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.101595E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.501 | TFLOPs: 40.08 | 15: iteration 30680/ 125429 | consumed samples: 7854080 | consumed tokens: 16085155840 | elapsed time per iteration (s): 1.05 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.133988E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.827 | TFLOPs: 40.46 | 15: iteration 30690/ 125429 | consumed samples: 7856640 | consumed tokens: 16090398720 | elapsed time per iteration (s): 1.06 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.110817E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.997 | TFLOPs: 39.83 | 15: iteration 30700/ 125429 | consumed samples: 7859200 | consumed tokens: 16095641600 | elapsed time per iteration (s): 1.03 | learning rate: 1.762E-04 | global batch size: 256 | lm loss: 2.103849E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.454 | TFLOPs: 41.06 | 15: iteration 30710/ 125429 | consumed samples: 7861760 | consumed tokens: 16100884480 | elapsed time per iteration (s): 1.06 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.086550E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.500 | TFLOPs: 39.74 | 15: iteration 30720/ 125429 | consumed samples: 7864320 | consumed tokens: 16106127360 | elapsed time per iteration (s): 1.05 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.084286E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.721 | TFLOPs: 40.28 | 15: iteration 30730/ 125429 | consumed samples: 7866880 | consumed tokens: 16111370240 | elapsed time per iteration (s): 1.04 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.089653E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.097 | TFLOPs: 40.83 | 15: iteration 30740/ 125429 | consumed samples: 7869440 | consumed tokens: 16116613120 | elapsed time per iteration (s): 1.06 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.073568E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.790 | TFLOPs: 39.96 | 15: iteration 30750/ 125429 | consumed samples: 7872000 | consumed tokens: 16121856000 | elapsed time per iteration (s): 1.10 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.086069E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.709 | TFLOPs: 38.46 | 15: iteration 30760/ 125429 | consumed samples: 7874560 | consumed tokens: 16127098880 | elapsed time per iteration (s): 1.04 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.081900E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.481 | TFLOPs: 40.57 | 15: iteration 30770/ 125429 | consumed samples: 7877120 | consumed tokens: 16132341760 | elapsed time per iteration (s): 1.10 | learning rate: 1.761E-04 | global batch size: 256 | lm loss: 2.065758E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.853 | TFLOPs: 38.48 | 15: iteration 30780/ 125429 | consumed samples: 7879680 | consumed tokens: 16137584640 | elapsed time per iteration (s): 1.06 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.093832E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.711 | TFLOPs: 39.94 | 15: iteration 30790/ 125429 | consumed samples: 7882240 | consumed tokens: 16142827520 | elapsed time per iteration (s): 1.04 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.048032E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.099 | TFLOPs: 40.67 | 15: iteration 30800/ 125429 | consumed samples: 7884800 | consumed tokens: 16148070400 | elapsed time per iteration (s): 1.07 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.093050E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.359 | TFLOPs: 39.72 | 15: iteration 30810/ 125429 | consumed samples: 7887360 | consumed tokens: 16153313280 | elapsed time per iteration (s): 1.08 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.071103E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.459 | TFLOPs: 39.08 | 15: iteration 30820/ 125429 | consumed samples: 7889920 | consumed tokens: 16158556160 | elapsed time per iteration (s): 1.03 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.085174E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.752 | TFLOPs: 41.11 | 15: iteration 30830/ 125429 | consumed samples: 7892480 | consumed tokens: 16163799040 | elapsed time per iteration (s): 1.06 | learning rate: 1.760E-04 | global batch size: 256 | lm loss: 2.072429E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.113 | TFLOPs: 39.85 | 15: iteration 30840/ 125429 | consumed samples: 7895040 | consumed tokens: 16169041920 | elapsed time per iteration (s): 1.05 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.104132E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.907 | TFLOPs: 40.47 | 15: iteration 30850/ 125429 | consumed samples: 7897600 | consumed tokens: 16174284800 | elapsed time per iteration (s): 1.13 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.093980E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.103 | TFLOPs: 37.37 | 15: iteration 30860/ 125429 | consumed samples: 7900160 | consumed tokens: 16179527680 | elapsed time per iteration (s): 1.02 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.081832E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.579 | TFLOPs: 41.41 | 15: iteration 30870/ 125429 | consumed samples: 7902720 | consumed tokens: 16184770560 | elapsed time per iteration (s): 1.05 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.105141E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.712 | TFLOPs: 40.11 | 15: iteration 30880/ 125429 | consumed samples: 7905280 | consumed tokens: 16190013440 | elapsed time per iteration (s): 1.04 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.071339E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.269 | TFLOPs: 40.70 | 15: iteration 30890/ 125429 | consumed samples: 7907840 | consumed tokens: 16195256320 | elapsed time per iteration (s): 1.04 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.089943E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.265 | TFLOPs: 40.53 | 15: iteration 30900/ 125429 | consumed samples: 7910400 | consumed tokens: 16200499200 | elapsed time per iteration (s): 1.22 | learning rate: 1.759E-04 | global batch size: 256 | lm loss: 2.072261E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 209.139 | TFLOPs: 34.56 | 15: iteration 30910/ 125429 | consumed samples: 7912960 | consumed tokens: 16205742080 | elapsed time per iteration (s): 1.06 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.072412E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.619 | TFLOPs: 40.09 | 15: iteration 30920/ 125429 | consumed samples: 7915520 | consumed tokens: 16210984960 | elapsed time per iteration (s): 1.06 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.097262E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.644 | TFLOPs: 39.93 | 15: iteration 30930/ 125429 | consumed samples: 7918080 | consumed tokens: 16216227840 | elapsed time per iteration (s): 1.04 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.081548E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.179 | TFLOPs: 40.68 | 15: iteration 30940/ 125429 | consumed samples: 7920640 | consumed tokens: 16221470720 | elapsed time per iteration (s): 1.03 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.093656E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.855 | TFLOPs: 41.13 | 15: iteration 30950/ 125429 | consumed samples: 7923200 | consumed tokens: 16226713600 | elapsed time per iteration (s): 1.04 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.069310E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.699 | TFLOPs: 40.77 | 15: iteration 30960/ 125429 | consumed samples: 7925760 | consumed tokens: 16231956480 | elapsed time per iteration (s): 1.04 | learning rate: 1.758E-04 | global batch size: 256 | lm loss: 2.103293E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.111 | TFLOPs: 40.67 | 15: iteration 30970/ 125429 | consumed samples: 7928320 | consumed tokens: 16237199360 | elapsed time per iteration (s): 1.04 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.096610E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.482 | TFLOPs: 40.57 | 15: iteration 30980/ 125429 | consumed samples: 7930880 | consumed tokens: 16242442240 | elapsed time per iteration (s): 1.04 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.053459E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.684 | TFLOPs: 40.60 | 15: iteration 30990/ 125429 | consumed samples: 7933440 | consumed tokens: 16247685120 | elapsed time per iteration (s): 1.03 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.084346E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.210 | TFLOPs: 41.02 | 15: iteration 31000/ 125429 | consumed samples: 7936000 | consumed tokens: 16252928000 | elapsed time per iteration (s): 1.03 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.091650E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.159 | TFLOPs: 41.18 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 31000 | lm loss value: 2.002763E+00 | lm loss PPL: 7.409502E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 31000 to checkpoints_1b5 0: [2022-11-26 05:04:07,877] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step31000 is begin to save! 0: [2022-11-26 05:04:07,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:04:08,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:04:08,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:04:08,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:04:08,256] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:04:08,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:04:08,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:04:08,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:04:08,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:04:08,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:04:08,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:04:08,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:04:08,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:04:08,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:04:08,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:04:08,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:04:08,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:04:09,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:04:09,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:04:09,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:04:09,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:04:09,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:04:09,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:04:09,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:04:09,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:04:09,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:04:09,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:04:09,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:04:09,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:04:09,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:04:09,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:04:09,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:04:09,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:04:09,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:04:09,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:04:10,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:04:10,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:04:10,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:04:10,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:04:10,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:04:10,332] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:04:10,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:04:10,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:04:10,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:04:10,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:04:10,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:04:10,670] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:04:10,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:04:10,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:04:10,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:04:10,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:04:11,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:04:11,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:04:11,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:04:11,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_29-model_00-model_states.pt... 0: [2022-11-26 05:04:11,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_29-model_00-model_states.pt. 0: [2022-11-26 05:04:11,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:04:11,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:04:11,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/layer_32-model_00-model_states.pt... 0: [2022-11-26 05:04:11,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/layer_32-model_00-model_states.pt. 0: [2022-11-26 05:04:11,353] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step31000/mp_rank_00_model_states.pt 0: [2022-11-26 05:04:11,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:04:11,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:04:11,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:04:11,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step31000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:04:11,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 05:04:11,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:04:11,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:04:11,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:04:11,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 05:04:11,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:04:11,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 05:04:11,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:04:11,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 05:04:11,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 05:04:11,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:04:11,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 05:04:11,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 05:04:11,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 05:04:11,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 05:04:11,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:04:11,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 05:04:11,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 4: [2022-11-26 05:04:11,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 13: [2022-11-26 05:04:11,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:04:11,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:04:11,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:04:11,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:04:11,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:04:11,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 05:04:11,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:04:11,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:04:11,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 05:04:11,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 05:04:11,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 05:04:11,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 05:04:11,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:04:11,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:04:11,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 3: [2022-11-26 05:04:11,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:04:11,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 4: [2022-11-26 05:04:11,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 3: [2022-11-26 05:04:11,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 05:04:11,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 05:04:11,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 05:04:11,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 05:04:11,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 05:04:11,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 4: [2022-11-26 05:04:11,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:04:11,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:04:11,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 05:04:11,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 05:04:11,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 05:04:11,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:04:11,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 05:04:11,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:04:11,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 05:04:11,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 05:04:11,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 8: [2022-11-26 05:04:11,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 13: [2022-11-26 05:04:11,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:04:11,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 05:04:11,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 2: [2022-11-26 05:04:11,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:04:11,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:04:11,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:04:11,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 6: [2022-11-26 05:04:11,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:04:11,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 05:04:11,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 7: [2022-11-26 05:04:11,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:04:11,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:04:11,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 11: [2022-11-26 05:04:11,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:04:11,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 1: [2022-11-26 05:04:11,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:04:11,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:04:11,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 10: [2022-11-26 05:04:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:04:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:04:11,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 12: [2022-11-26 05:04:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:04:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 05:04:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 9: [2022-11-26 05:04:11,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:04:11,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 05:04:11,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 14: [2022-11-26 05:04:11,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:04:11,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 05:04:11,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: [2022-11-26 05:04:11,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 05:04:11,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 5: [2022-11-26 05:04:11,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:04:11,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step31000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 05:04:11,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 15: [2022-11-26 05:04:11,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step31000 is ready now! 0: successfully saved checkpoint at iteration 31000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3962.07 15: iteration 31010/ 125429 | consumed samples: 7938560 | consumed tokens: 16258170880 | elapsed time per iteration (s): 1.46 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.099804E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.923 | TFLOPs: 28.91 | 15: iteration 31020/ 125429 | consumed samples: 7941120 | consumed tokens: 16263413760 | elapsed time per iteration (s): 1.04 | learning rate: 1.757E-04 | global batch size: 256 | lm loss: 2.137166E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.245 | TFLOPs: 40.53 | 15: iteration 31030/ 125429 | consumed samples: 7943680 | consumed tokens: 16268656640 | elapsed time per iteration (s): 1.05 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.084248E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.788 | TFLOPs: 40.29 | 15: iteration 31040/ 125429 | consumed samples: 7946240 | consumed tokens: 16273899520 | elapsed time per iteration (s): 1.03 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.147338E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.114 | TFLOPs: 41.00 | 15: iteration 31050/ 125429 | consumed samples: 7948800 | consumed tokens: 16279142400 | elapsed time per iteration (s): 1.08 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.080739E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.579 | TFLOPs: 39.26 | 15: iteration 31060/ 125429 | consumed samples: 7951360 | consumed tokens: 16284385280 | elapsed time per iteration (s): 1.04 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.076691E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.279 | TFLOPs: 40.70 | 15: iteration 31070/ 125429 | consumed samples: 7953920 | consumed tokens: 16289628160 | elapsed time per iteration (s): 1.03 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.125444E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.165 | TFLOPs: 41.18 | 15: iteration 31080/ 125429 | consumed samples: 7956480 | consumed tokens: 16294871040 | elapsed time per iteration (s): 1.06 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.105479E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.873 | TFLOPs: 39.97 | 15: iteration 31090/ 125429 | consumed samples: 7959040 | consumed tokens: 16300113920 | elapsed time per iteration (s): 1.05 | learning rate: 1.756E-04 | global batch size: 256 | lm loss: 2.110537E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.718 | TFLOPs: 40.28 | 15: iteration 31100/ 125429 | consumed samples: 7961600 | consumed tokens: 16305356800 | elapsed time per iteration (s): 1.02 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.076660E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.126 | TFLOPs: 41.50 | 15: iteration 31110/ 125429 | consumed samples: 7964160 | consumed tokens: 16310599680 | elapsed time per iteration (s): 1.04 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.100028E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.202 | TFLOPs: 40.85 | 15: iteration 31120/ 125429 | consumed samples: 7966720 | consumed tokens: 16315842560 | elapsed time per iteration (s): 1.03 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.081158E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.677 | TFLOPs: 41.26 | 15: iteration 31130/ 125429 | consumed samples: 7969280 | consumed tokens: 16321085440 | elapsed time per iteration (s): 1.02 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.115551E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.868 | TFLOPs: 41.29 | 15: iteration 31140/ 125429 | consumed samples: 7971840 | consumed tokens: 16326328320 | elapsed time per iteration (s): 1.03 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.088512E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.459 | TFLOPs: 41.22 | 15: iteration 31150/ 125429 | consumed samples: 7974400 | consumed tokens: 16331571200 | elapsed time per iteration (s): 1.02 | learning rate: 1.755E-04 | global batch size: 256 | lm loss: 2.072178E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.879 | TFLOPs: 41.46 | 15: iteration 31160/ 125429 | consumed samples: 7976960 | consumed tokens: 16336814080 | elapsed time per iteration (s): 1.07 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.079854E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.211 | TFLOPs: 39.70 | 15: iteration 31170/ 125429 | consumed samples: 7979520 | consumed tokens: 16342056960 | elapsed time per iteration (s): 1.04 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.090577E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.407 | TFLOPs: 40.72 | 15: iteration 31180/ 125429 | consumed samples: 7982080 | consumed tokens: 16347299840 | elapsed time per iteration (s): 1.02 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.103204E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.760 | TFLOPs: 41.27 | 15: iteration 31190/ 125429 | consumed samples: 7984640 | consumed tokens: 16352542720 | elapsed time per iteration (s): 1.06 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.080689E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.186 | TFLOPs: 40.02 | 15: iteration 31200/ 125429 | consumed samples: 7987200 | consumed tokens: 16357785600 | elapsed time per iteration (s): 1.06 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.089248E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.363 | TFLOPs: 39.89 | 15: iteration 31210/ 125429 | consumed samples: 7989760 | consumed tokens: 16363028480 | elapsed time per iteration (s): 1.06 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.080248E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.910 | TFLOPs: 39.81 | 15: iteration 31220/ 125429 | consumed samples: 7992320 | consumed tokens: 16368271360 | elapsed time per iteration (s): 1.07 | learning rate: 1.754E-04 | global batch size: 256 | lm loss: 2.104524E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.523 | TFLOPs: 39.42 | 15: iteration 31230/ 125429 | consumed samples: 7994880 | consumed tokens: 16373514240 | elapsed time per iteration (s): 1.08 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.081774E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.084 | TFLOPs: 39.35 | 15: iteration 31240/ 125429 | consumed samples: 7997440 | consumed tokens: 16378757120 | elapsed time per iteration (s): 1.06 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.103983E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.865 | TFLOPs: 39.80 | 15: iteration 31250/ 125429 | consumed samples: 8000000 | consumed tokens: 16384000000 | elapsed time per iteration (s): 1.05 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.078317E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.384 | TFLOPs: 40.22 | 15: iteration 31260/ 125429 | consumed samples: 8002560 | consumed tokens: 16389242880 | elapsed time per iteration (s): 1.06 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.059738E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.351 | TFLOPs: 40.05 | 15: iteration 31270/ 125429 | consumed samples: 8005120 | consumed tokens: 16394485760 | elapsed time per iteration (s): 1.03 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.057043E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.186 | TFLOPs: 41.01 | 15: iteration 31280/ 125429 | consumed samples: 8007680 | consumed tokens: 16399728640 | elapsed time per iteration (s): 1.03 | learning rate: 1.753E-04 | global batch size: 256 | lm loss: 2.066974E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.350 | TFLOPs: 40.88 | 15: iteration 31290/ 125429 | consumed samples: 8010240 | consumed tokens: 16404971520 | elapsed time per iteration (s): 1.03 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.061518E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.043 | TFLOPs: 40.99 | 15: iteration 31300/ 125429 | consumed samples: 8012800 | consumed tokens: 16410214400 | elapsed time per iteration (s): 1.04 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.091862E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.459 | TFLOPs: 40.56 | 15: iteration 31310/ 125429 | consumed samples: 8015360 | consumed tokens: 16415457280 | elapsed time per iteration (s): 1.04 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.053018E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.253 | TFLOPs: 40.86 | 15: iteration 31320/ 125429 | consumed samples: 8017920 | consumed tokens: 16420700160 | elapsed time per iteration (s): 1.03 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.103154E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.205 | TFLOPs: 41.02 | 15: iteration 31330/ 125429 | consumed samples: 8020480 | consumed tokens: 16425943040 | elapsed time per iteration (s): 1.04 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.069328E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.099 | TFLOPs: 40.84 | 15: iteration 31340/ 125429 | consumed samples: 8023040 | consumed tokens: 16431185920 | elapsed time per iteration (s): 1.05 | learning rate: 1.752E-04 | global batch size: 256 | lm loss: 2.108176E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.045 | TFLOPs: 40.33 | 15: iteration 31350/ 125429 | consumed samples: 8025600 | consumed tokens: 16436428800 | elapsed time per iteration (s): 1.05 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.072722E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.883 | TFLOPs: 40.47 | 15: iteration 31360/ 125429 | consumed samples: 8028160 | consumed tokens: 16441671680 | elapsed time per iteration (s): 1.04 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.107678E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.053 | TFLOPs: 40.83 | 15: iteration 31370/ 125429 | consumed samples: 8030720 | consumed tokens: 16446914560 | elapsed time per iteration (s): 1.02 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.069990E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.835 | TFLOPs: 41.29 | 15: iteration 31380/ 125429 | consumed samples: 8033280 | consumed tokens: 16452157440 | elapsed time per iteration (s): 1.04 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.069127E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.356 | TFLOPs: 40.55 | 15: iteration 31390/ 125429 | consumed samples: 8035840 | consumed tokens: 16457400320 | elapsed time per iteration (s): 1.03 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.104559E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.514 | TFLOPs: 40.90 | 15: iteration 31400/ 125429 | consumed samples: 8038400 | consumed tokens: 16462643200 | elapsed time per iteration (s): 1.03 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.080371E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.222 | TFLOPs: 41.02 | 15: iteration 31410/ 125429 | consumed samples: 8040960 | consumed tokens: 16467886080 | elapsed time per iteration (s): 1.05 | learning rate: 1.751E-04 | global batch size: 256 | lm loss: 2.087226E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.083 | TFLOPs: 40.17 | 15: iteration 31420/ 125429 | consumed samples: 8043520 | consumed tokens: 16473128960 | elapsed time per iteration (s): 1.03 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.068364E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.764 | TFLOPs: 40.94 | 15: iteration 31430/ 125429 | consumed samples: 8046080 | consumed tokens: 16478371840 | elapsed time per iteration (s): 1.08 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.104661E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.192 | TFLOPs: 39.20 | 15: iteration 31440/ 125429 | consumed samples: 8048640 | consumed tokens: 16483614720 | elapsed time per iteration (s): 1.03 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.091561E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.285 | TFLOPs: 41.20 | 15: iteration 31450/ 125429 | consumed samples: 8051200 | consumed tokens: 16488857600 | elapsed time per iteration (s): 1.07 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.065530E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.988 | TFLOPs: 39.66 | 15: iteration 31460/ 125429 | consumed samples: 8053760 | consumed tokens: 16494100480 | elapsed time per iteration (s): 1.08 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.113960E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.208 | TFLOPs: 39.04 | 15: iteration 31470/ 125429 | consumed samples: 8056320 | consumed tokens: 16499343360 | elapsed time per iteration (s): 1.13 | learning rate: 1.750E-04 | global batch size: 256 | lm loss: 2.120380E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.451 | TFLOPs: 37.42 | 15: iteration 31480/ 125429 | consumed samples: 8058880 | consumed tokens: 16504586240 | elapsed time per iteration (s): 1.04 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.097522E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.244 | TFLOPs: 40.69 | 15: iteration 31490/ 125429 | consumed samples: 8061440 | consumed tokens: 16509829120 | elapsed time per iteration (s): 1.11 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.073109E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.073 | TFLOPs: 38.19 | 15: iteration 31500/ 125429 | consumed samples: 8064000 | consumed tokens: 16515072000 | elapsed time per iteration (s): 1.05 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.082616E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.835 | TFLOPs: 40.13 | 15: iteration 31510/ 125429 | consumed samples: 8066560 | consumed tokens: 16520314880 | elapsed time per iteration (s): 1.06 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.124192E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.599 | TFLOPs: 40.09 | 15: iteration 31520/ 125429 | consumed samples: 8069120 | consumed tokens: 16525557760 | elapsed time per iteration (s): 1.05 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.046237E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.317 | TFLOPs: 40.21 | 15: iteration 31530/ 125429 | consumed samples: 8071680 | consumed tokens: 16530800640 | elapsed time per iteration (s): 1.06 | learning rate: 1.749E-04 | global batch size: 256 | lm loss: 2.061749E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.889 | TFLOPs: 39.97 | 15: iteration 31540/ 125429 | consumed samples: 8074240 | consumed tokens: 16536043520 | elapsed time per iteration (s): 1.06 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.083812E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.233 | TFLOPs: 39.87 | 15: iteration 31550/ 125429 | consumed samples: 8076800 | consumed tokens: 16541286400 | elapsed time per iteration (s): 1.06 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.088103E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.594 | TFLOPs: 40.09 | 15: iteration 31560/ 125429 | consumed samples: 8079360 | consumed tokens: 16546529280 | elapsed time per iteration (s): 1.06 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.090440E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.758 | TFLOPs: 39.95 | 15: iteration 31570/ 125429 | consumed samples: 8081920 | consumed tokens: 16551772160 | elapsed time per iteration (s): 1.06 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.077266E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.066 | TFLOPs: 40.00 | 15: iteration 31580/ 125429 | consumed samples: 8084480 | consumed tokens: 16557015040 | elapsed time per iteration (s): 1.03 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.094287E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.818 | TFLOPs: 40.95 | 15: iteration 31590/ 125429 | consumed samples: 8087040 | consumed tokens: 16562257920 | elapsed time per iteration (s): 1.47 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.069889E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.895 | TFLOPs: 28.74 | 15: iteration 31600/ 125429 | consumed samples: 8089600 | consumed tokens: 16567500800 | elapsed time per iteration (s): 1.06 | learning rate: 1.748E-04 | global batch size: 256 | lm loss: 2.065867E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.787 | TFLOPs: 39.79 | 15: iteration 31610/ 125429 | consumed samples: 8092160 | consumed tokens: 16572743680 | elapsed time per iteration (s): 1.04 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.067763E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.293 | TFLOPs: 40.54 | 15: iteration 31620/ 125429 | consumed samples: 8094720 | consumed tokens: 16577986560 | elapsed time per iteration (s): 1.02 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.093371E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.969 | TFLOPs: 41.31 | 15: iteration 31630/ 125429 | consumed samples: 8097280 | consumed tokens: 16583229440 | elapsed time per iteration (s): 1.05 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.084023E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.372 | TFLOPs: 40.38 | 15: iteration 31640/ 125429 | consumed samples: 8099840 | consumed tokens: 16588472320 | elapsed time per iteration (s): 1.05 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.090969E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.375 | TFLOPs: 40.22 | 15: iteration 31650/ 125429 | consumed samples: 8102400 | consumed tokens: 16593715200 | elapsed time per iteration (s): 1.02 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.085041E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.407 | TFLOPs: 41.55 | 15: iteration 31660/ 125429 | consumed samples: 8104960 | consumed tokens: 16598958080 | elapsed time per iteration (s): 1.06 | learning rate: 1.747E-04 | global batch size: 256 | lm loss: 2.070277E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.047 | TFLOPs: 39.83 | 15: iteration 31670/ 125429 | consumed samples: 8107520 | consumed tokens: 16604200960 | elapsed time per iteration (s): 1.06 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.091488E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.425 | TFLOPs: 40.06 | 15: iteration 31680/ 125429 | consumed samples: 8110080 | consumed tokens: 16609443840 | elapsed time per iteration (s): 1.08 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.076811E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.112 | TFLOPs: 39.18 | 15: iteration 31690/ 125429 | consumed samples: 8112640 | consumed tokens: 16614686720 | elapsed time per iteration (s): 1.08 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.097066E+00 | grad norm: 0.184 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.707 | TFLOPs: 39.12 | 15: iteration 31700/ 125429 | consumed samples: 8115200 | consumed tokens: 16619929600 | elapsed time per iteration (s): 1.07 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.061450E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.388 | TFLOPs: 39.40 | 15: iteration 31710/ 125429 | consumed samples: 8117760 | consumed tokens: 16625172480 | elapsed time per iteration (s): 1.06 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.082516E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.252 | TFLOPs: 40.03 | 15: iteration 31720/ 125429 | consumed samples: 8120320 | consumed tokens: 16630415360 | elapsed time per iteration (s): 1.05 | learning rate: 1.746E-04 | global batch size: 256 | lm loss: 2.110137E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.834 | TFLOPs: 40.30 | 15: iteration 31730/ 125429 | consumed samples: 8122880 | consumed tokens: 16635658240 | elapsed time per iteration (s): 1.08 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.102264E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.190 | TFLOPs: 39.20 | 15: iteration 31740/ 125429 | consumed samples: 8125440 | consumed tokens: 16640901120 | elapsed time per iteration (s): 1.05 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.101504E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.885 | TFLOPs: 40.14 | 15: iteration 31750/ 125429 | consumed samples: 8128000 | consumed tokens: 16646144000 | elapsed time per iteration (s): 1.09 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.074182E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.354 | TFLOPs: 38.73 | 15: iteration 31760/ 125429 | consumed samples: 8130560 | consumed tokens: 16651386880 | elapsed time per iteration (s): 1.06 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.109759E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.661 | TFLOPs: 39.94 | 15: iteration 31770/ 125429 | consumed samples: 8133120 | consumed tokens: 16656629760 | elapsed time per iteration (s): 1.06 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.079453E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.711 | TFLOPs: 39.78 | 15: iteration 31780/ 125429 | consumed samples: 8135680 | consumed tokens: 16661872640 | elapsed time per iteration (s): 1.04 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.091917E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.137 | TFLOPs: 40.84 | 15: iteration 31790/ 125429 | consumed samples: 8138240 | consumed tokens: 16667115520 | elapsed time per iteration (s): 1.06 | learning rate: 1.745E-04 | global batch size: 256 | lm loss: 2.092440E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.689 | TFLOPs: 39.94 | 15: iteration 31800/ 125429 | consumed samples: 8140800 | consumed tokens: 16672358400 | elapsed time per iteration (s): 1.05 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.089502E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.325 | TFLOPs: 40.38 | 15: iteration 31810/ 125429 | consumed samples: 8143360 | consumed tokens: 16677601280 | elapsed time per iteration (s): 1.03 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.077683E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.887 | TFLOPs: 40.97 | 15: iteration 31820/ 125429 | consumed samples: 8145920 | consumed tokens: 16682844160 | elapsed time per iteration (s): 1.03 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.086398E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.736 | TFLOPs: 41.11 | 15: iteration 31830/ 125429 | consumed samples: 8148480 | consumed tokens: 16688087040 | elapsed time per iteration (s): 1.03 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.109716E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.065 | TFLOPs: 40.99 | 15: iteration 31840/ 125429 | consumed samples: 8151040 | consumed tokens: 16693329920 | elapsed time per iteration (s): 1.04 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.078567E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.029 | TFLOPs: 40.49 | 15: iteration 31850/ 125429 | consumed samples: 8153600 | consumed tokens: 16698572800 | elapsed time per iteration (s): 1.05 | learning rate: 1.744E-04 | global batch size: 256 | lm loss: 2.085633E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.227 | TFLOPs: 40.20 | 15: iteration 31860/ 125429 | consumed samples: 8156160 | consumed tokens: 16703815680 | elapsed time per iteration (s): 1.04 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.054404E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.228 | TFLOPs: 40.86 | 15: iteration 31870/ 125429 | consumed samples: 8158720 | consumed tokens: 16709058560 | elapsed time per iteration (s): 1.03 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.086131E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.015 | TFLOPs: 41.15 | 15: iteration 31880/ 125429 | consumed samples: 8161280 | consumed tokens: 16714301440 | elapsed time per iteration (s): 1.07 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.077430E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.486 | TFLOPs: 39.41 | 15: iteration 31890/ 125429 | consumed samples: 8163840 | consumed tokens: 16719544320 | elapsed time per iteration (s): 1.03 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.054421E+00 | grad norm: 0.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.008 | TFLOPs: 40.99 | 15: iteration 31900/ 125429 | consumed samples: 8166400 | consumed tokens: 16724787200 | elapsed time per iteration (s): 1.04 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.118294E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.017 | TFLOPs: 40.66 | 15: iteration 31910/ 125429 | consumed samples: 8168960 | consumed tokens: 16730030080 | elapsed time per iteration (s): 2.45 | learning rate: 1.743E-04 | global batch size: 256 | lm loss: 2.098715E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.676 | TFLOPs: 17.30 | 15: iteration 31920/ 125429 | consumed samples: 8171520 | consumed tokens: 16735272960 | elapsed time per iteration (s): 1.12 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.063402E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.665 | TFLOPs: 37.79 | 15: iteration 31930/ 125429 | consumed samples: 8174080 | consumed tokens: 16740515840 | elapsed time per iteration (s): 1.04 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.089367E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.339 | TFLOPs: 40.87 | 15: iteration 31940/ 125429 | consumed samples: 8176640 | consumed tokens: 16745758720 | elapsed time per iteration (s): 1.02 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.103276E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.591 | TFLOPs: 41.58 | 15: iteration 31950/ 125429 | consumed samples: 8179200 | consumed tokens: 16751001600 | elapsed time per iteration (s): 1.03 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.071457E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.864 | TFLOPs: 41.13 | 15: iteration 31960/ 125429 | consumed samples: 8181760 | consumed tokens: 16756244480 | elapsed time per iteration (s): 1.03 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.053873E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.046 | TFLOPs: 41.16 | 15: iteration 31970/ 125429 | consumed samples: 8184320 | consumed tokens: 16761487360 | elapsed time per iteration (s): 1.04 | learning rate: 1.742E-04 | global batch size: 256 | lm loss: 2.102843E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.263 | TFLOPs: 40.86 | 15: iteration 31980/ 125429 | consumed samples: 8186880 | consumed tokens: 16766730240 | elapsed time per iteration (s): 1.08 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.083720E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.826 | TFLOPs: 39.30 | 15: iteration 31990/ 125429 | consumed samples: 8189440 | consumed tokens: 16771973120 | elapsed time per iteration (s): 1.02 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.084553E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.272 | TFLOPs: 41.36 | 0: [2022-11-26 05:21:59,101] [INFO] [logging.py:68:log_dist] [Rank 0] step=32000, skipped=0, lr=[0.00017411756748273423, 0.00017411756748273423, 0.00017411756748273423], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 32000/ 125429 | consumed samples: 8192000 | consumed tokens: 16777216000 | elapsed time per iteration (s): 1.04 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.069052E+00 | grad norm: 0.270 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.137 | TFLOPs: 40.84 | 0: steps: 32000 loss: 2.0009 iter time (s): 1.059 samples/sec: 241.790 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 32000 | lm loss value: 1.968428E+00 | lm loss PPL: 7.159411E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 32000 to checkpoints_1b5 0: [2022-11-26 05:21:59,448] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step32000 is begin to save! 0: [2022-11-26 05:21:59,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:21:59,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:21:59,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:21:59,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:21:59,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:21:59,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:21:59,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:22:00,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:22:00,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:22:00,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:22:00,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:22:00,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:22:00,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:22:00,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:22:00,314] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:22:00,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:22:00,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:22:00,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:22:00,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:22:00,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:22:00,627] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:22:00,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:22:00,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:22:00,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:22:00,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:22:00,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:22:00,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:22:01,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:22:01,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:22:01,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:22:01,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:22:01,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:22:01,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:22:01,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:22:01,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:22:01,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:22:01,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:22:01,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:22:01,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:22:01,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:22:01,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:22:01,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:22:01,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:22:01,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:22:01,889] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:22:01,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:22:01,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:22:02,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:22:02,099] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:22:02,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:22:02,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:22:02,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:22:02,303] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:22:02,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:22:02,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_29-model_00-model_states.pt... 0: [2022-11-26 05:22:02,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_29-model_00-model_states.pt. 0: [2022-11-26 05:22:02,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:22:02,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:22:02,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/layer_32-model_00-model_states.pt... 0: [2022-11-26 05:22:02,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/layer_32-model_00-model_states.pt. 0: [2022-11-26 05:22:02,621] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step32000/mp_rank_00_model_states.pt 0: [2022-11-26 05:22:02,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:22:02,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:22:02,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:22:02,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:22:02,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 05:22:02,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 05:22:02,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:22:02,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:22:02,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:22:02,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 05:22:02,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 3: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 05:22:02,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 05:22:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:22:02,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 05:22:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:22:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 9: [2022-11-26 05:22:02,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:22:02,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:22:02,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:22:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 05:22:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 05:22:02,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 7: [2022-11-26 05:22:02,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:22:02,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:22:02,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 05:22:02,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 05:22:02,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:22:02,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 13: [2022-11-26 05:22:02,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 05:22:02,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 05:22:02,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:22:02,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-26 05:22:02,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:22:02,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:22:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-26 05:22:02,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:22:02,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 13: [2022-11-26 05:22:02,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:22:02,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 8: [2022-11-26 05:22:02,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:22:02,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 10: [2022-11-26 05:22:02,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:22:02,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 05:22:02,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 05:22:02,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 05:22:02,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 05:22:02,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 11: [2022-11-26 05:22:02,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:22:02,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 05:22:02,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 5: [2022-11-26 05:22:02,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:22:02,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:22:02,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:22:02,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 15: [2022-11-26 05:22:02,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:22:02,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 05:22:02,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 4: [2022-11-26 05:22:02,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:22:02,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:22:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 12: [2022-11-26 05:22:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:22:02,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:22:02,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:22:02,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 13: [2022-11-26 05:22:02,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:22:02,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 05:22:02,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: [2022-11-26 05:22:02,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 05:22:02,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:22:02,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 05:22:02,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 1: [2022-11-26 05:22:02,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:02,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:02,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:02,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:02,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:02,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 2: [2022-11-26 05:22:02,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:22:02,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:22:02,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:03,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:03,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:03,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:03,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:03,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:03,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 6: [2022-11-26 05:22:03,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:22:03,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step32000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:22:03,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step32000 is ready now! 0: successfully saved checkpoint at iteration 32000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3574.90 15: iteration 32010/ 125429 | consumed samples: 8194560 | consumed tokens: 16782458880 | elapsed time per iteration (s): 1.45 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.092041E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.523 | TFLOPs: 29.17 | 15: iteration 32020/ 125429 | consumed samples: 8197120 | consumed tokens: 16787701760 | elapsed time per iteration (s): 1.06 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.069195E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.494 | TFLOPs: 39.74 | 15: iteration 32030/ 125429 | consumed samples: 8199680 | consumed tokens: 16792944640 | elapsed time per iteration (s): 1.04 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.091117E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.241 | TFLOPs: 40.53 | 15: iteration 32040/ 125429 | consumed samples: 8202240 | consumed tokens: 16798187520 | elapsed time per iteration (s): 1.04 | learning rate: 1.741E-04 | global batch size: 256 | lm loss: 2.114166E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.630 | TFLOPs: 40.76 | 15: iteration 32050/ 125429 | consumed samples: 8204800 | consumed tokens: 16803430400 | elapsed time per iteration (s): 1.02 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.113195E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.346 | TFLOPs: 41.37 | 15: iteration 32060/ 125429 | consumed samples: 8207360 | consumed tokens: 16808673280 | elapsed time per iteration (s): 1.06 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.076140E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.251 | TFLOPs: 40.03 | 15: iteration 32070/ 125429 | consumed samples: 8209920 | consumed tokens: 16813916160 | elapsed time per iteration (s): 1.03 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.086631E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.025 | TFLOPs: 40.99 | 15: iteration 32080/ 125429 | consumed samples: 8212480 | consumed tokens: 16819159040 | elapsed time per iteration (s): 1.11 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.107235E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.217 | TFLOPs: 38.21 | 15: iteration 32090/ 125429 | consumed samples: 8215040 | consumed tokens: 16824401920 | elapsed time per iteration (s): 1.05 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.101922E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.263 | TFLOPs: 40.37 | 15: iteration 32100/ 125429 | consumed samples: 8217600 | consumed tokens: 16829644800 | elapsed time per iteration (s): 1.03 | learning rate: 1.740E-04 | global batch size: 256 | lm loss: 2.091010E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.322 | TFLOPs: 41.20 | 15: iteration 32110/ 125429 | consumed samples: 8220160 | consumed tokens: 16834887680 | elapsed time per iteration (s): 1.03 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.082806E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.482 | TFLOPs: 40.90 | 15: iteration 32120/ 125429 | consumed samples: 8222720 | consumed tokens: 16840130560 | elapsed time per iteration (s): 1.05 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.082925E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.030 | TFLOPs: 40.33 | 15: iteration 32130/ 125429 | consumed samples: 8225280 | consumed tokens: 16845373440 | elapsed time per iteration (s): 1.06 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.135730E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.532 | TFLOPs: 39.91 | 15: iteration 32140/ 125429 | consumed samples: 8227840 | consumed tokens: 16850616320 | elapsed time per iteration (s): 1.03 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.056242E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.595 | TFLOPs: 40.92 | 15: iteration 32150/ 125429 | consumed samples: 8230400 | consumed tokens: 16855859200 | elapsed time per iteration (s): 1.05 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.075299E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.960 | TFLOPs: 40.48 | 15: iteration 32160/ 125429 | consumed samples: 8232960 | consumed tokens: 16861102080 | elapsed time per iteration (s): 1.05 | learning rate: 1.739E-04 | global batch size: 256 | lm loss: 2.108046E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.386 | TFLOPs: 40.39 | 15: iteration 32170/ 125429 | consumed samples: 8235520 | consumed tokens: 16866344960 | elapsed time per iteration (s): 1.02 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.073058E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.912 | TFLOPs: 41.30 | 15: iteration 32180/ 125429 | consumed samples: 8238080 | consumed tokens: 16871587840 | elapsed time per iteration (s): 1.05 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.081741E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.161 | TFLOPs: 40.18 | 15: iteration 32190/ 125429 | consumed samples: 8240640 | consumed tokens: 16876830720 | elapsed time per iteration (s): 1.04 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.084521E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.591 | TFLOPs: 40.59 | 15: iteration 32200/ 125429 | consumed samples: 8243200 | consumed tokens: 16882073600 | elapsed time per iteration (s): 1.07 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.098295E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.851 | TFLOPs: 39.47 | 15: iteration 32210/ 125429 | consumed samples: 8245760 | consumed tokens: 16887316480 | elapsed time per iteration (s): 1.08 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.087330E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.356 | TFLOPs: 39.06 | 15: iteration 32220/ 125429 | consumed samples: 8248320 | consumed tokens: 16892559360 | elapsed time per iteration (s): 1.03 | learning rate: 1.738E-04 | global batch size: 256 | lm loss: 2.076639E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.058 | TFLOPs: 40.99 | 15: iteration 32230/ 125429 | consumed samples: 8250880 | consumed tokens: 16897802240 | elapsed time per iteration (s): 1.02 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.069548E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.792 | TFLOPs: 41.28 | 15: iteration 32240/ 125429 | consumed samples: 8253440 | consumed tokens: 16903045120 | elapsed time per iteration (s): 1.03 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.102697E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.905 | TFLOPs: 41.13 | 15: iteration 32250/ 125429 | consumed samples: 8256000 | consumed tokens: 16908288000 | elapsed time per iteration (s): 1.05 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.093395E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.561 | TFLOPs: 40.42 | 15: iteration 32260/ 125429 | consumed samples: 8258560 | consumed tokens: 16913530880 | elapsed time per iteration (s): 1.02 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.161083E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.859 | TFLOPs: 41.29 | 15: iteration 32270/ 125429 | consumed samples: 8261120 | consumed tokens: 16918773760 | elapsed time per iteration (s): 1.02 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.099089E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.992 | TFLOPs: 41.31 | 15: iteration 32280/ 125429 | consumed samples: 8263680 | consumed tokens: 16924016640 | elapsed time per iteration (s): 1.03 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.076079E+00 | grad norm: 0.210 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.453 | TFLOPs: 41.22 | 15: iteration 32290/ 125429 | consumed samples: 8266240 | consumed tokens: 16929259520 | elapsed time per iteration (s): 1.03 | learning rate: 1.737E-04 | global batch size: 256 | lm loss: 2.112110E+00 | grad norm: 0.187 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.571 | TFLOPs: 40.91 | 15: iteration 32300/ 125429 | consumed samples: 8268800 | consumed tokens: 16934502400 | elapsed time per iteration (s): 1.96 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.094328E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 130.633 | TFLOPs: 21.59 | 15: iteration 32310/ 125429 | consumed samples: 8271360 | consumed tokens: 16939745280 | elapsed time per iteration (s): 1.03 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.179337E+00 | grad norm: 13.798 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.611 | TFLOPs: 41.25 | 15: iteration 32320/ 125429 | consumed samples: 8273920 | consumed tokens: 16944988160 | elapsed time per iteration (s): 1.04 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.939831E+00 | grad norm: 0.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.596 | TFLOPs: 40.75 | 15: iteration 32330/ 125429 | consumed samples: 8276480 | consumed tokens: 16950231040 | elapsed time per iteration (s): 1.05 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.179296E+00 | grad norm: 0.380 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.630 | TFLOPs: 40.43 | 15: iteration 32340/ 125429 | consumed samples: 8279040 | consumed tokens: 16955473920 | elapsed time per iteration (s): 1.04 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.146231E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.140 | TFLOPs: 40.68 | 15: iteration 32350/ 125429 | consumed samples: 8281600 | consumed tokens: 16960716800 | elapsed time per iteration (s): 1.02 | learning rate: 1.736E-04 | global batch size: 256 | lm loss: 2.124873E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.289 | TFLOPs: 41.53 | 15: iteration 32360/ 125429 | consumed samples: 8284160 | consumed tokens: 16965959680 | elapsed time per iteration (s): 1.03 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.094100E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.761 | TFLOPs: 41.11 | 15: iteration 32370/ 125429 | consumed samples: 8286720 | consumed tokens: 16971202560 | elapsed time per iteration (s): 1.05 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.095872E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.791 | TFLOPs: 40.29 | 15: iteration 32380/ 125429 | consumed samples: 8289280 | consumed tokens: 16976445440 | elapsed time per iteration (s): 1.05 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.088613E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.213 | TFLOPs: 40.36 | 15: iteration 32390/ 125429 | consumed samples: 8291840 | consumed tokens: 16981688320 | elapsed time per iteration (s): 1.05 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.089967E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.514 | TFLOPs: 40.24 | 15: iteration 32400/ 125429 | consumed samples: 8294400 | consumed tokens: 16986931200 | elapsed time per iteration (s): 1.05 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.076343E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.860 | TFLOPs: 40.13 | 15: iteration 32410/ 125429 | consumed samples: 8296960 | consumed tokens: 16992174080 | elapsed time per iteration (s): 1.03 | learning rate: 1.735E-04 | global batch size: 256 | lm loss: 2.082401E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.528 | TFLOPs: 41.07 | 15: iteration 32420/ 125429 | consumed samples: 8299520 | consumed tokens: 16997416960 | elapsed time per iteration (s): 1.03 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.101314E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.534 | TFLOPs: 41.24 | 15: iteration 32430/ 125429 | consumed samples: 8302080 | consumed tokens: 17002659840 | elapsed time per iteration (s): 1.05 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.092960E+00 | grad norm: 0.233 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.821 | TFLOPs: 40.46 | 15: iteration 32440/ 125429 | consumed samples: 8304640 | consumed tokens: 17007902720 | elapsed time per iteration (s): 1.03 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.123026E+00 | grad norm: 0.615 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.515 | TFLOPs: 40.90 | 15: iteration 32450/ 125429 | consumed samples: 8307200 | consumed tokens: 17013145600 | elapsed time per iteration (s): 1.06 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.093929E+00 | grad norm: 0.374 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.193 | TFLOPs: 40.02 | 15: iteration 32460/ 125429 | consumed samples: 8309760 | consumed tokens: 17018388480 | elapsed time per iteration (s): 1.04 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.105227E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.003 | TFLOPs: 40.82 | 15: iteration 32470/ 125429 | consumed samples: 8312320 | consumed tokens: 17023631360 | elapsed time per iteration (s): 1.03 | learning rate: 1.734E-04 | global batch size: 256 | lm loss: 2.107861E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.204 | TFLOPs: 41.02 | 15: iteration 32480/ 125429 | consumed samples: 8314880 | consumed tokens: 17028874240 | elapsed time per iteration (s): 1.06 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.094082E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.371 | TFLOPs: 39.89 | 15: iteration 32490/ 125429 | consumed samples: 8317440 | consumed tokens: 17034117120 | elapsed time per iteration (s): 1.04 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.047015E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.562 | TFLOPs: 40.58 | 15: iteration 32500/ 125429 | consumed samples: 8320000 | consumed tokens: 17039360000 | elapsed time per iteration (s): 1.04 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.071778E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.354 | TFLOPs: 40.71 | 15: iteration 32510/ 125429 | consumed samples: 8322560 | consumed tokens: 17044602880 | elapsed time per iteration (s): 1.04 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.072616E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.353 | TFLOPs: 40.71 | 15: iteration 32520/ 125429 | consumed samples: 8325120 | consumed tokens: 17049845760 | elapsed time per iteration (s): 1.06 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.087955E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.752 | TFLOPs: 39.95 | 15: iteration 32530/ 125429 | consumed samples: 8327680 | consumed tokens: 17055088640 | elapsed time per iteration (s): 1.05 | learning rate: 1.733E-04 | global batch size: 256 | lm loss: 2.074782E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.194 | TFLOPs: 40.19 | 15: iteration 32540/ 125429 | consumed samples: 8330240 | consumed tokens: 17060331520 | elapsed time per iteration (s): 1.04 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.073518E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.309 | TFLOPs: 40.54 | 15: iteration 32550/ 125429 | consumed samples: 8332800 | consumed tokens: 17065574400 | elapsed time per iteration (s): 1.04 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.107947E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.379 | TFLOPs: 40.55 | 15: iteration 32560/ 125429 | consumed samples: 8335360 | consumed tokens: 17070817280 | elapsed time per iteration (s): 1.04 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.088169E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.289 | TFLOPs: 40.70 | 15: iteration 32570/ 125429 | consumed samples: 8337920 | consumed tokens: 17076060160 | elapsed time per iteration (s): 1.06 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.076238E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.321 | TFLOPs: 40.05 | 15: iteration 32580/ 125429 | consumed samples: 8340480 | consumed tokens: 17081303040 | elapsed time per iteration (s): 2.59 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.070587E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 98.692 | TFLOPs: 16.31 | 15: iteration 32590/ 125429 | consumed samples: 8343040 | consumed tokens: 17086545920 | elapsed time per iteration (s): 1.02 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.091150E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.538 | TFLOPs: 41.57 | 15: iteration 32600/ 125429 | consumed samples: 8345600 | consumed tokens: 17091788800 | elapsed time per iteration (s): 1.05 | learning rate: 1.732E-04 | global batch size: 256 | lm loss: 2.078116E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.894 | TFLOPs: 40.14 | 15: iteration 32610/ 125429 | consumed samples: 8348160 | consumed tokens: 17097031680 | elapsed time per iteration (s): 1.04 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.047244E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.853 | TFLOPs: 40.63 | 15: iteration 32620/ 125429 | consumed samples: 8350720 | consumed tokens: 17102274560 | elapsed time per iteration (s): 1.07 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.086200E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.581 | TFLOPs: 39.43 | 15: iteration 32630/ 125429 | consumed samples: 8353280 | consumed tokens: 17107517440 | elapsed time per iteration (s): 1.09 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.066686E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.519 | TFLOPs: 38.92 | 15: iteration 32640/ 125429 | consumed samples: 8355840 | consumed tokens: 17112760320 | elapsed time per iteration (s): 1.06 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.069846E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.285 | TFLOPs: 39.87 | 15: iteration 32650/ 125429 | consumed samples: 8358400 | consumed tokens: 17118003200 | elapsed time per iteration (s): 1.06 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.052551E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.578 | TFLOPs: 39.92 | 15: iteration 32660/ 125429 | consumed samples: 8360960 | consumed tokens: 17123246080 | elapsed time per iteration (s): 1.07 | learning rate: 1.731E-04 | global batch size: 256 | lm loss: 2.047260E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.735 | TFLOPs: 39.45 | 15: iteration 32670/ 125429 | consumed samples: 8363520 | consumed tokens: 17128488960 | elapsed time per iteration (s): 1.05 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.090760E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.277 | TFLOPs: 40.37 | 15: iteration 32680/ 125429 | consumed samples: 8366080 | consumed tokens: 17133731840 | elapsed time per iteration (s): 1.04 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.088780E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.263 | TFLOPs: 40.86 | 15: iteration 32690/ 125429 | consumed samples: 8368640 | consumed tokens: 17138974720 | elapsed time per iteration (s): 1.02 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.075077E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.474 | TFLOPs: 41.39 | 15: iteration 32700/ 125429 | consumed samples: 8371200 | consumed tokens: 17144217600 | elapsed time per iteration (s): 1.05 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.072641E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.846 | TFLOPs: 40.30 | 15: iteration 32710/ 125429 | consumed samples: 8373760 | consumed tokens: 17149460480 | elapsed time per iteration (s): 1.05 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.079718E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.906 | TFLOPs: 40.47 | 15: iteration 32720/ 125429 | consumed samples: 8376320 | consumed tokens: 17154703360 | elapsed time per iteration (s): 1.04 | learning rate: 1.730E-04 | global batch size: 256 | lm loss: 2.060456E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.988 | TFLOPs: 40.49 | 15: iteration 32730/ 125429 | consumed samples: 8378880 | consumed tokens: 17159946240 | elapsed time per iteration (s): 1.03 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.085178E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.919 | TFLOPs: 40.97 | 15: iteration 32740/ 125429 | consumed samples: 8381440 | consumed tokens: 17165189120 | elapsed time per iteration (s): 1.05 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.066718E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.019 | TFLOPs: 40.16 | 15: iteration 32750/ 125429 | consumed samples: 8384000 | consumed tokens: 17170432000 | elapsed time per iteration (s): 1.04 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.092118E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.481 | TFLOPs: 40.57 | 15: iteration 32760/ 125429 | consumed samples: 8386560 | consumed tokens: 17175674880 | elapsed time per iteration (s): 1.03 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.079902E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.380 | TFLOPs: 41.05 | 15: iteration 32770/ 125429 | consumed samples: 8389120 | consumed tokens: 17180917760 | elapsed time per iteration (s): 1.06 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.090312E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.013 | TFLOPs: 39.99 | 15: iteration 32780/ 125429 | consumed samples: 8391680 | consumed tokens: 17186160640 | elapsed time per iteration (s): 1.09 | learning rate: 1.729E-04 | global batch size: 256 | lm loss: 2.063731E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.910 | TFLOPs: 38.99 | 15: iteration 32790/ 125429 | consumed samples: 8394240 | consumed tokens: 17191403520 | elapsed time per iteration (s): 1.06 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.085356E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.216 | TFLOPs: 39.86 | 15: iteration 32800/ 125429 | consumed samples: 8396800 | consumed tokens: 17196646400 | elapsed time per iteration (s): 1.08 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.074112E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.817 | TFLOPs: 39.14 | 15: iteration 32810/ 125429 | consumed samples: 8399360 | consumed tokens: 17201889280 | elapsed time per iteration (s): 1.09 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.093841E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.601 | TFLOPs: 38.93 | 15: iteration 32820/ 125429 | consumed samples: 8401920 | consumed tokens: 17207132160 | elapsed time per iteration (s): 1.03 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.097379E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.648 | TFLOPs: 41.09 | 15: iteration 32830/ 125429 | consumed samples: 8404480 | consumed tokens: 17212375040 | elapsed time per iteration (s): 1.03 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.115670E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.221 | TFLOPs: 41.02 | 15: iteration 32840/ 125429 | consumed samples: 8407040 | consumed tokens: 17217617920 | elapsed time per iteration (s): 1.09 | learning rate: 1.728E-04 | global batch size: 256 | lm loss: 2.093085E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.135 | TFLOPs: 38.86 | 15: iteration 32850/ 125429 | consumed samples: 8409600 | consumed tokens: 17222860800 | elapsed time per iteration (s): 1.03 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.061459E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.337 | TFLOPs: 41.20 | 15: iteration 32860/ 125429 | consumed samples: 8412160 | consumed tokens: 17228103680 | elapsed time per iteration (s): 1.02 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.089540E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.435 | TFLOPs: 41.39 | 15: iteration 32870/ 125429 | consumed samples: 8414720 | consumed tokens: 17233346560 | elapsed time per iteration (s): 1.03 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.070535E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.910 | TFLOPs: 41.13 | 15: iteration 32880/ 125429 | consumed samples: 8417280 | consumed tokens: 17238589440 | elapsed time per iteration (s): 1.02 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.075169E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.203 | TFLOPs: 41.35 | 15: iteration 32890/ 125429 | consumed samples: 8419840 | consumed tokens: 17243832320 | elapsed time per iteration (s): 1.04 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.062092E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.689 | TFLOPs: 40.77 | 15: iteration 32900/ 125429 | consumed samples: 8422400 | consumed tokens: 17249075200 | elapsed time per iteration (s): 1.04 | learning rate: 1.727E-04 | global batch size: 256 | lm loss: 2.072580E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.619 | TFLOPs: 40.76 | 15: iteration 32910/ 125429 | consumed samples: 8424960 | consumed tokens: 17254318080 | elapsed time per iteration (s): 1.05 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.071889E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.520 | TFLOPs: 40.41 | 15: iteration 32920/ 125429 | consumed samples: 8427520 | consumed tokens: 17259560960 | elapsed time per iteration (s): 1.13 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.063366E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.283 | TFLOPs: 37.40 | 15: iteration 32930/ 125429 | consumed samples: 8430080 | consumed tokens: 17264803840 | elapsed time per iteration (s): 1.04 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.110367E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.669 | TFLOPs: 40.76 | 15: iteration 32940/ 125429 | consumed samples: 8432640 | consumed tokens: 17270046720 | elapsed time per iteration (s): 1.07 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.104080E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.327 | TFLOPs: 39.72 | 15: iteration 32950/ 125429 | consumed samples: 8435200 | consumed tokens: 17275289600 | elapsed time per iteration (s): 1.03 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.102145E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.140 | TFLOPs: 41.01 | 15: iteration 32960/ 125429 | consumed samples: 8437760 | consumed tokens: 17280532480 | elapsed time per iteration (s): 1.04 | learning rate: 1.726E-04 | global batch size: 256 | lm loss: 2.075461E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.904 | TFLOPs: 40.64 | 15: iteration 32970/ 125429 | consumed samples: 8440320 | consumed tokens: 17285775360 | elapsed time per iteration (s): 1.02 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.086187E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.900 | TFLOPs: 41.30 | 15: iteration 32980/ 125429 | consumed samples: 8442880 | consumed tokens: 17291018240 | elapsed time per iteration (s): 1.06 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.102929E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.907 | TFLOPs: 39.81 | 15: iteration 32990/ 125429 | consumed samples: 8445440 | consumed tokens: 17296261120 | elapsed time per iteration (s): 1.03 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.054186E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.377 | TFLOPs: 41.05 | 15: iteration 33000/ 125429 | consumed samples: 8448000 | consumed tokens: 17301504000 | elapsed time per iteration (s): 1.07 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.109677E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.172 | TFLOPs: 39.69 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 33000 | lm loss value: 2.049441E+00 | lm loss PPL: 7.763563E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 33000 to checkpoints_1b5 0: [2022-11-26 05:39:54,214] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step33000 is begin to save! 0: [2022-11-26 05:39:54,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:39:54,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:39:54,456] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:39:54,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:39:54,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:39:54,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:39:54,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:39:54,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:39:54,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:39:54,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:39:54,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:39:54,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:39:54,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:39:55,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:39:55,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:39:55,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:39:55,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:39:55,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:39:55,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:39:55,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:39:55,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:39:55,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:39:55,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:39:55,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:39:55,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:39:55,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:39:55,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:39:55,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:39:55,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:39:55,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:39:55,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:39:56,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:39:56,050] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:39:56,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:39:56,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:39:56,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:39:56,261] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:39:56,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:39:56,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:39:56,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:39:56,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:39:56,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:39:56,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:39:56,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:39:56,684] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:39:56,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:39:56,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:39:56,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:39:56,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:39:57,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:39:57,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:39:57,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:39:57,116] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:39:57,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:39:57,224] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_29-model_00-model_states.pt... 0: [2022-11-26 05:39:57,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_29-model_00-model_states.pt. 0: [2022-11-26 05:39:57,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:39:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:39:57,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/layer_32-model_00-model_states.pt... 0: [2022-11-26 05:39:57,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/layer_32-model_00-model_states.pt. 0: [2022-11-26 05:39:57,439] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step33000/mp_rank_00_model_states.pt 0: [2022-11-26 05:39:57,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:39:57,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:39:57,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step33000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:39:57,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:39:57,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:39:57,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 05:39:57,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:39:57,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 05:39:57,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 05:39:57,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 05:39:57,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 05:39:57,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:39:57,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:39:57,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:39:57,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 05:39:57,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 05:39:57,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 05:39:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 05:39:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:39:57,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 05:39:57,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 05:39:57,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 05:39:57,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 05:39:57,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 05:39:57,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 05:39:57,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:39:57,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 05:39:57,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:39:57,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 05:39:57,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:39:57,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:39:57,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:39:57,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 05:39:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 05:39:57,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:39:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:39:57,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 05:39:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:39:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 05:39:57,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 05:39:57,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:39:57,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 05:39:57,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 05:39:57,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 05:39:57,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 05:39:57,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 05:39:57,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:39:57,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 13: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 05:39:57,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 05:39:57,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:39:57,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:39:57,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 13: [2022-11-26 05:39:57,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 14: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 14: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 13: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:39:57,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:39:57,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 05:39:57,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:39:57,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 05:39:57,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:39:57,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:39:57,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:39:57,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:39:57,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:39:57,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 9: [2022-11-26 05:39:57,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:39:57,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:39:57,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 05:39:57,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 2: [2022-11-26 05:39:57,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 1: [2022-11-26 05:39:57,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:39:57,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:39:57,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 12: [2022-11-26 05:39:57,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:39:57,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 05:39:57,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:39:57,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:39:57,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:39:57,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:39:57,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:39:57,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:39:57,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 05:39:57,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 05:39:57,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 05:39:57,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 4: [2022-11-26 05:39:57,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:39:57,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:39:57,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 05:39:57,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:39:57,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 6: [2022-11-26 05:39:57,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:39:57,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 05:39:57,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:39:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:39:57,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:39:57,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 8: [2022-11-26 05:39:57,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:39:57,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 05:39:57,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 5: [2022-11-26 05:39:57,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:39:57,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:39:57,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:39:57,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:39:57,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:39:57,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 10: [2022-11-26 05:39:57,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:39:57,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:39:57,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:39:57,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:39:57,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 7: [2022-11-26 05:39:57,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:39:57,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:39:57,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 3: [2022-11-26 05:39:57,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:39:57,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:39:57,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 15: [2022-11-26 05:39:57,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:39:57,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 05:39:57,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step33000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 05:39:57,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 11: [2022-11-26 05:39:57,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step33000 is ready now! 0: successfully saved checkpoint at iteration 33000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3711.11 15: iteration 33010/ 125429 | consumed samples: 8450560 | consumed tokens: 17306746880 | elapsed time per iteration (s): 1.45 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.096006E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.305 | TFLOPs: 29.14 | 15: iteration 33020/ 125429 | consumed samples: 8453120 | consumed tokens: 17311989760 | elapsed time per iteration (s): 1.07 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.093254E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.327 | TFLOPs: 39.55 | 15: iteration 33030/ 125429 | consumed samples: 8455680 | consumed tokens: 17317232640 | elapsed time per iteration (s): 1.03 | learning rate: 1.725E-04 | global batch size: 256 | lm loss: 2.085898E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.668 | TFLOPs: 41.09 | 15: iteration 33040/ 125429 | consumed samples: 8458240 | consumed tokens: 17322475520 | elapsed time per iteration (s): 1.04 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.054418E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.520 | TFLOPs: 40.74 | 15: iteration 33050/ 125429 | consumed samples: 8460800 | consumed tokens: 17327718400 | elapsed time per iteration (s): 1.14 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.048767E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.449 | TFLOPs: 37.09 | 15: iteration 33060/ 125429 | consumed samples: 8463360 | consumed tokens: 17332961280 | elapsed time per iteration (s): 1.06 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.067808E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.057 | TFLOPs: 39.84 | 15: iteration 33070/ 125429 | consumed samples: 8465920 | consumed tokens: 17338204160 | elapsed time per iteration (s): 1.04 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.071114E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.831 | TFLOPs: 40.63 | 15: iteration 33080/ 125429 | consumed samples: 8468480 | consumed tokens: 17343447040 | elapsed time per iteration (s): 1.04 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.044182E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.095 | TFLOPs: 40.50 | 15: iteration 33090/ 125429 | consumed samples: 8471040 | consumed tokens: 17348689920 | elapsed time per iteration (s): 1.06 | learning rate: 1.724E-04 | global batch size: 256 | lm loss: 2.079775E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.287 | TFLOPs: 40.04 | 15: iteration 33100/ 125429 | consumed samples: 8473600 | consumed tokens: 17353932800 | elapsed time per iteration (s): 1.08 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.077582E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.864 | TFLOPs: 39.31 | 15: iteration 33110/ 125429 | consumed samples: 8476160 | consumed tokens: 17359175680 | elapsed time per iteration (s): 1.06 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.077531E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.353 | TFLOPs: 39.89 | 15: iteration 33120/ 125429 | consumed samples: 8478720 | consumed tokens: 17364418560 | elapsed time per iteration (s): 1.04 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.042002E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.212 | TFLOPs: 40.69 | 15: iteration 33130/ 125429 | consumed samples: 8481280 | consumed tokens: 17369661440 | elapsed time per iteration (s): 1.06 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.054109E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.442 | TFLOPs: 39.73 | 15: iteration 33140/ 125429 | consumed samples: 8483840 | consumed tokens: 17374904320 | elapsed time per iteration (s): 1.06 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.055196E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.611 | TFLOPs: 40.09 | 15: iteration 33150/ 125429 | consumed samples: 8486400 | consumed tokens: 17380147200 | elapsed time per iteration (s): 1.07 | learning rate: 1.723E-04 | global batch size: 256 | lm loss: 2.065584E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.786 | TFLOPs: 39.63 | 15: iteration 33160/ 125429 | consumed samples: 8488960 | consumed tokens: 17385390080 | elapsed time per iteration (s): 1.08 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.083047E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.685 | TFLOPs: 39.28 | 15: iteration 33170/ 125429 | consumed samples: 8491520 | consumed tokens: 17390632960 | elapsed time per iteration (s): 1.12 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.074901E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.482 | TFLOPs: 37.76 | 15: iteration 33180/ 125429 | consumed samples: 8494080 | consumed tokens: 17395875840 | elapsed time per iteration (s): 2.48 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.077323E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 103.043 | TFLOPs: 17.03 | 15: iteration 33190/ 125429 | consumed samples: 8496640 | consumed tokens: 17401118720 | elapsed time per iteration (s): 1.03 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.075371E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.495 | TFLOPs: 40.90 | 15: iteration 33200/ 125429 | consumed samples: 8499200 | consumed tokens: 17406361600 | elapsed time per iteration (s): 1.07 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.067166E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.838 | TFLOPs: 39.47 | 15: iteration 33210/ 125429 | consumed samples: 8501760 | consumed tokens: 17411604480 | elapsed time per iteration (s): 1.09 | learning rate: 1.722E-04 | global batch size: 256 | lm loss: 2.080500E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.945 | TFLOPs: 38.83 | 15: iteration 33220/ 125429 | consumed samples: 8504320 | consumed tokens: 17416847360 | elapsed time per iteration (s): 1.08 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.077583E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.046 | TFLOPs: 39.17 | 15: iteration 33230/ 125429 | consumed samples: 8506880 | consumed tokens: 17422090240 | elapsed time per iteration (s): 1.06 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.071929E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.143 | TFLOPs: 40.02 | 15: iteration 33240/ 125429 | consumed samples: 8509440 | consumed tokens: 17427333120 | elapsed time per iteration (s): 1.03 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.095127E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.786 | TFLOPs: 40.95 | 15: iteration 33250/ 125429 | consumed samples: 8512000 | consumed tokens: 17432576000 | elapsed time per iteration (s): 1.06 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.076205E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.085 | TFLOPs: 40.01 | 15: iteration 33260/ 125429 | consumed samples: 8514560 | consumed tokens: 17437818880 | elapsed time per iteration (s): 1.09 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.046174E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.884 | TFLOPs: 38.65 | 15: iteration 33270/ 125429 | consumed samples: 8517120 | consumed tokens: 17443061760 | elapsed time per iteration (s): 1.03 | learning rate: 1.721E-04 | global batch size: 256 | lm loss: 2.109257E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.647 | TFLOPs: 40.93 | 15: iteration 33280/ 125429 | consumed samples: 8519680 | consumed tokens: 17448304640 | elapsed time per iteration (s): 1.11 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.115067E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.218 | TFLOPs: 38.21 | 15: iteration 33290/ 125429 | consumed samples: 8522240 | consumed tokens: 17453547520 | elapsed time per iteration (s): 1.06 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.095524E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.211 | TFLOPs: 40.03 | 15: iteration 33300/ 125429 | consumed samples: 8524800 | consumed tokens: 17458790400 | elapsed time per iteration (s): 1.02 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.051272E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.228 | TFLOPs: 41.35 | 15: iteration 33310/ 125429 | consumed samples: 8527360 | consumed tokens: 17464033280 | elapsed time per iteration (s): 1.05 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.114069E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.860 | TFLOPs: 40.30 | 15: iteration 33320/ 125429 | consumed samples: 8529920 | consumed tokens: 17469276160 | elapsed time per iteration (s): 1.03 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.105125E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.983 | TFLOPs: 40.98 | 15: iteration 33330/ 125429 | consumed samples: 8532480 | consumed tokens: 17474519040 | elapsed time per iteration (s): 1.02 | learning rate: 1.720E-04 | global batch size: 256 | lm loss: 2.080025E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.415 | TFLOPs: 41.38 | 15: iteration 33340/ 125429 | consumed samples: 8535040 | consumed tokens: 17479761920 | elapsed time per iteration (s): 1.07 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.084986E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.459 | TFLOPs: 39.57 | 15: iteration 33350/ 125429 | consumed samples: 8537600 | consumed tokens: 17485004800 | elapsed time per iteration (s): 1.05 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.052600E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.792 | TFLOPs: 40.45 | 15: iteration 33360/ 125429 | consumed samples: 8540160 | consumed tokens: 17490247680 | elapsed time per iteration (s): 1.03 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.073511E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.563 | TFLOPs: 41.24 | 15: iteration 33370/ 125429 | consumed samples: 8542720 | consumed tokens: 17495490560 | elapsed time per iteration (s): 1.03 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.077593E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.055 | TFLOPs: 40.99 | 15: iteration 33380/ 125429 | consumed samples: 8545280 | consumed tokens: 17500733440 | elapsed time per iteration (s): 1.05 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.100169E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.367 | TFLOPs: 40.38 | 15: iteration 33390/ 125429 | consumed samples: 8547840 | consumed tokens: 17505976320 | elapsed time per iteration (s): 1.04 | learning rate: 1.719E-04 | global batch size: 256 | lm loss: 2.087807E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.854 | TFLOPs: 40.79 | 15: iteration 33400/ 125429 | consumed samples: 8550400 | consumed tokens: 17511219200 | elapsed time per iteration (s): 1.02 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.080518E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.437 | TFLOPs: 41.39 | 15: iteration 33410/ 125429 | consumed samples: 8552960 | consumed tokens: 17516462080 | elapsed time per iteration (s): 1.02 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.091743E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.775 | TFLOPs: 41.44 | 15: iteration 33420/ 125429 | consumed samples: 8555520 | consumed tokens: 17521704960 | elapsed time per iteration (s): 1.05 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.043192E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.239 | TFLOPs: 40.36 | 15: iteration 33430/ 125429 | consumed samples: 8558080 | consumed tokens: 17526947840 | elapsed time per iteration (s): 1.02 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.074576E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.965 | TFLOPs: 41.31 | 15: iteration 33440/ 125429 | consumed samples: 8560640 | consumed tokens: 17532190720 | elapsed time per iteration (s): 1.07 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.052969E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.573 | TFLOPs: 39.59 | 15: iteration 33450/ 125429 | consumed samples: 8563200 | consumed tokens: 17537433600 | elapsed time per iteration (s): 1.03 | learning rate: 1.718E-04 | global batch size: 256 | lm loss: 2.107329E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.773 | TFLOPs: 41.11 | 15: iteration 33460/ 125429 | consumed samples: 8565760 | consumed tokens: 17542676480 | elapsed time per iteration (s): 1.04 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.097189E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.765 | TFLOPs: 40.61 | 15: iteration 33470/ 125429 | consumed samples: 8568320 | consumed tokens: 17547919360 | elapsed time per iteration (s): 1.07 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.066214E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.780 | TFLOPs: 39.63 | 15: iteration 33480/ 125429 | consumed samples: 8570880 | consumed tokens: 17553162240 | elapsed time per iteration (s): 1.04 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.078147E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.918 | TFLOPs: 40.64 | 15: iteration 33490/ 125429 | consumed samples: 8573440 | consumed tokens: 17558405120 | elapsed time per iteration (s): 1.05 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.086846E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.548 | TFLOPs: 40.41 | 15: iteration 33500/ 125429 | consumed samples: 8576000 | consumed tokens: 17563648000 | elapsed time per iteration (s): 1.03 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.072745E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.663 | TFLOPs: 41.26 | 15: iteration 33510/ 125429 | consumed samples: 8578560 | consumed tokens: 17568890880 | elapsed time per iteration (s): 1.03 | learning rate: 1.717E-04 | global batch size: 256 | lm loss: 2.078300E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.637 | TFLOPs: 40.92 | 15: iteration 33520/ 125429 | consumed samples: 8581120 | consumed tokens: 17574133760 | elapsed time per iteration (s): 1.02 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.063317E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.131 | TFLOPs: 41.34 | 15: iteration 33530/ 125429 | consumed samples: 8583680 | consumed tokens: 17579376640 | elapsed time per iteration (s): 1.04 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.068791E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.275 | TFLOPs: 40.53 | 15: iteration 33540/ 125429 | consumed samples: 8586240 | consumed tokens: 17584619520 | elapsed time per iteration (s): 1.05 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.076641E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.879 | TFLOPs: 40.30 | 15: iteration 33550/ 125429 | consumed samples: 8588800 | consumed tokens: 17589862400 | elapsed time per iteration (s): 1.07 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.101474E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.473 | TFLOPs: 39.41 | 15: iteration 33560/ 125429 | consumed samples: 8591360 | consumed tokens: 17595105280 | elapsed time per iteration (s): 1.02 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.060417E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.392 | TFLOPs: 41.38 | 15: iteration 33570/ 125429 | consumed samples: 8593920 | consumed tokens: 17600348160 | elapsed time per iteration (s): 1.04 | learning rate: 1.716E-04 | global batch size: 256 | lm loss: 2.085826E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.232 | TFLOPs: 40.69 | 15: iteration 33580/ 125429 | consumed samples: 8596480 | consumed tokens: 17605591040 | elapsed time per iteration (s): 1.05 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.076817E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.552 | TFLOPs: 40.41 | 15: iteration 33590/ 125429 | consumed samples: 8599040 | consumed tokens: 17610833920 | elapsed time per iteration (s): 1.05 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.083651E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.310 | TFLOPs: 40.37 | 15: iteration 33600/ 125429 | consumed samples: 8601600 | consumed tokens: 17616076800 | elapsed time per iteration (s): 1.04 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.065825E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.477 | TFLOPs: 40.73 | 15: iteration 33610/ 125429 | consumed samples: 8604160 | consumed tokens: 17621319680 | elapsed time per iteration (s): 1.04 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.084860E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.072 | TFLOPs: 40.50 | 15: iteration 33620/ 125429 | consumed samples: 8606720 | consumed tokens: 17626562560 | elapsed time per iteration (s): 1.04 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.064648E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.030 | TFLOPs: 40.82 | 15: iteration 33630/ 125429 | consumed samples: 8609280 | consumed tokens: 17631805440 | elapsed time per iteration (s): 1.07 | learning rate: 1.715E-04 | global batch size: 256 | lm loss: 2.063200E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.861 | TFLOPs: 39.64 | 15: iteration 33640/ 125429 | consumed samples: 8611840 | consumed tokens: 17637048320 | elapsed time per iteration (s): 1.05 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.089346E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.653 | TFLOPs: 40.43 | 15: iteration 33650/ 125429 | consumed samples: 8614400 | consumed tokens: 17642291200 | elapsed time per iteration (s): 1.05 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.083513E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.926 | TFLOPs: 40.15 | 15: iteration 33660/ 125429 | consumed samples: 8616960 | consumed tokens: 17647534080 | elapsed time per iteration (s): 1.05 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.078724E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.098 | TFLOPs: 40.17 | 15: iteration 33670/ 125429 | consumed samples: 8619520 | consumed tokens: 17652776960 | elapsed time per iteration (s): 1.04 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.059706E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.728 | TFLOPs: 40.61 | 15: iteration 33680/ 125429 | consumed samples: 8622080 | consumed tokens: 17658019840 | elapsed time per iteration (s): 1.05 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.070206E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.119 | TFLOPs: 40.18 | 15: iteration 33690/ 125429 | consumed samples: 8624640 | consumed tokens: 17663262720 | elapsed time per iteration (s): 1.07 | learning rate: 1.714E-04 | global batch size: 256 | lm loss: 2.065736E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.148 | TFLOPs: 39.36 | 15: iteration 33700/ 125429 | consumed samples: 8627200 | consumed tokens: 17668505600 | elapsed time per iteration (s): 1.03 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.093485E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.631 | TFLOPs: 41.09 | 15: iteration 33710/ 125429 | consumed samples: 8629760 | consumed tokens: 17673748480 | elapsed time per iteration (s): 1.04 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.110175E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.661 | TFLOPs: 40.76 | 15: iteration 33720/ 125429 | consumed samples: 8632320 | consumed tokens: 17678991360 | elapsed time per iteration (s): 1.03 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.071654E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.401 | TFLOPs: 41.05 | 15: iteration 33730/ 125429 | consumed samples: 8634880 | consumed tokens: 17684234240 | elapsed time per iteration (s): 1.02 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.097255E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.561 | TFLOPs: 41.41 | 15: iteration 33740/ 125429 | consumed samples: 8637440 | consumed tokens: 17689477120 | elapsed time per iteration (s): 1.06 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.084569E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.779 | TFLOPs: 39.96 | 15: iteration 33750/ 125429 | consumed samples: 8640000 | consumed tokens: 17694720000 | elapsed time per iteration (s): 1.06 | learning rate: 1.713E-04 | global batch size: 256 | lm loss: 2.087046E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.931 | TFLOPs: 39.82 | 15: iteration 33760/ 125429 | consumed samples: 8642560 | consumed tokens: 17699962880 | elapsed time per iteration (s): 1.03 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.116736E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.185 | TFLOPs: 41.01 | 15: iteration 33770/ 125429 | consumed samples: 8645120 | consumed tokens: 17705205760 | elapsed time per iteration (s): 1.04 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.079731E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.719 | TFLOPs: 40.77 | 15: iteration 33780/ 125429 | consumed samples: 8647680 | consumed tokens: 17710448640 | elapsed time per iteration (s): 1.06 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.076111E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.594 | TFLOPs: 39.93 | 15: iteration 33790/ 125429 | consumed samples: 8650240 | consumed tokens: 17715691520 | elapsed time per iteration (s): 1.03 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.092229E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.224 | TFLOPs: 41.02 | 15: iteration 33800/ 125429 | consumed samples: 8652800 | consumed tokens: 17720934400 | elapsed time per iteration (s): 1.03 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.080393E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.279 | TFLOPs: 41.03 | 15: iteration 33810/ 125429 | consumed samples: 8655360 | consumed tokens: 17726177280 | elapsed time per iteration (s): 1.03 | learning rate: 1.712E-04 | global batch size: 256 | lm loss: 2.096778E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.372 | TFLOPs: 41.21 | 15: iteration 33820/ 125429 | consumed samples: 8657920 | consumed tokens: 17731420160 | elapsed time per iteration (s): 1.02 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.080802E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.228 | TFLOPs: 41.52 | 15: iteration 33830/ 125429 | consumed samples: 8660480 | consumed tokens: 17736663040 | elapsed time per iteration (s): 1.03 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.093870E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.767 | TFLOPs: 40.95 | 15: iteration 33840/ 125429 | consumed samples: 8663040 | consumed tokens: 17741905920 | elapsed time per iteration (s): 1.03 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.108047E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.363 | TFLOPs: 40.88 | 15: iteration 33850/ 125429 | consumed samples: 8665600 | consumed tokens: 17747148800 | elapsed time per iteration (s): 1.03 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.110097E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.415 | TFLOPs: 41.22 | 15: iteration 33860/ 125429 | consumed samples: 8668160 | consumed tokens: 17752391680 | elapsed time per iteration (s): 1.05 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.056079E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.773 | TFLOPs: 40.12 | 15: iteration 33870/ 125429 | consumed samples: 8670720 | consumed tokens: 17757634560 | elapsed time per iteration (s): 1.03 | learning rate: 1.711E-04 | global batch size: 256 | lm loss: 2.073285E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.077 | TFLOPs: 41.00 | 15: iteration 33880/ 125429 | consumed samples: 8673280 | consumed tokens: 17762877440 | elapsed time per iteration (s): 1.03 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.058531E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.378 | TFLOPs: 40.88 | 15: iteration 33890/ 125429 | consumed samples: 8675840 | consumed tokens: 17768120320 | elapsed time per iteration (s): 1.08 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.067362E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.675 | TFLOPs: 39.28 | 15: iteration 33900/ 125429 | consumed samples: 8678400 | consumed tokens: 17773363200 | elapsed time per iteration (s): 1.02 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.069526E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.345 | TFLOPs: 41.54 | 15: iteration 33910/ 125429 | consumed samples: 8680960 | consumed tokens: 17778606080 | elapsed time per iteration (s): 1.04 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.087922E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.745 | TFLOPs: 40.61 | 15: iteration 33920/ 125429 | consumed samples: 8683520 | consumed tokens: 17783848960 | elapsed time per iteration (s): 1.05 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.077230E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.760 | TFLOPs: 40.28 | 15: iteration 33930/ 125429 | consumed samples: 8686080 | consumed tokens: 17789091840 | elapsed time per iteration (s): 1.06 | learning rate: 1.710E-04 | global batch size: 256 | lm loss: 2.059933E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.438 | TFLOPs: 39.90 | 15: iteration 33940/ 125429 | consumed samples: 8688640 | consumed tokens: 17794334720 | elapsed time per iteration (s): 1.16 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.077185E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.225 | TFLOPs: 36.39 | 15: iteration 33950/ 125429 | consumed samples: 8691200 | consumed tokens: 17799577600 | elapsed time per iteration (s): 1.03 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.077219E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.459 | TFLOPs: 41.06 | 15: iteration 33960/ 125429 | consumed samples: 8693760 | consumed tokens: 17804820480 | elapsed time per iteration (s): 1.07 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.079839E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.738 | TFLOPs: 39.45 | 15: iteration 33970/ 125429 | consumed samples: 8696320 | consumed tokens: 17810063360 | elapsed time per iteration (s): 1.03 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.053654E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.680 | TFLOPs: 41.10 | 15: iteration 33980/ 125429 | consumed samples: 8698880 | consumed tokens: 17815306240 | elapsed time per iteration (s): 1.04 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.074000E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.219 | TFLOPs: 40.85 | 15: iteration 33990/ 125429 | consumed samples: 8701440 | consumed tokens: 17820549120 | elapsed time per iteration (s): 1.05 | learning rate: 1.709E-04 | global batch size: 256 | lm loss: 2.082308E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.870 | TFLOPs: 40.14 | 0: [2022-11-26 05:57:41,585] [INFO] [logging.py:68:log_dist] [Rank 0] step=34000, skipped=0, lr=[0.00017084110088462937, 0.00017084110088462937, 0.00017084110088462937], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: steps: 34000 loss: 2.1330 iter time (s): 1.065 samples/sec: 240.452 15: iteration 34000/ 125429 | consumed samples: 8704000 | consumed tokens: 17825792000 | elapsed time per iteration (s): 1.04 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.092585E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.008 | TFLOPs: 40.65 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 34000 | lm loss value: 2.024457E+00 | lm loss PPL: 7.571996E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 34000 to checkpoints_1b5 0: [2022-11-26 05:57:41,930] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step34000 is begin to save! 0: [2022-11-26 05:57:41,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_01-model_00-model_states.pt... 0: [2022-11-26 05:57:42,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_01-model_00-model_states.pt. 0: [2022-11-26 05:57:42,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_03-model_00-model_states.pt... 0: [2022-11-26 05:57:42,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_03-model_00-model_states.pt. 0: [2022-11-26 05:57:42,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_04-model_00-model_states.pt... 0: [2022-11-26 05:57:42,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_04-model_00-model_states.pt. 0: [2022-11-26 05:57:42,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_05-model_00-model_states.pt... 0: [2022-11-26 05:57:42,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_05-model_00-model_states.pt. 0: [2022-11-26 05:57:42,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_06-model_00-model_states.pt... 0: [2022-11-26 05:57:42,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_06-model_00-model_states.pt. 0: [2022-11-26 05:57:42,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_07-model_00-model_states.pt... 0: [2022-11-26 05:57:42,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_07-model_00-model_states.pt. 0: [2022-11-26 05:57:42,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_08-model_00-model_states.pt... 0: [2022-11-26 05:57:42,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_08-model_00-model_states.pt. 0: [2022-11-26 05:57:42,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_09-model_00-model_states.pt... 0: [2022-11-26 05:57:42,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_09-model_00-model_states.pt. 0: [2022-11-26 05:57:42,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_10-model_00-model_states.pt... 0: [2022-11-26 05:57:43,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_10-model_00-model_states.pt. 0: [2022-11-26 05:57:43,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_11-model_00-model_states.pt... 0: [2022-11-26 05:57:43,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_11-model_00-model_states.pt. 0: [2022-11-26 05:57:43,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_12-model_00-model_states.pt... 0: [2022-11-26 05:57:43,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_12-model_00-model_states.pt. 0: [2022-11-26 05:57:43,286] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_13-model_00-model_states.pt... 0: [2022-11-26 05:57:43,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_13-model_00-model_states.pt. 0: [2022-11-26 05:57:43,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_14-model_00-model_states.pt... 0: [2022-11-26 05:57:43,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_14-model_00-model_states.pt. 0: [2022-11-26 05:57:43,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_15-model_00-model_states.pt... 0: [2022-11-26 05:57:43,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_15-model_00-model_states.pt. 0: [2022-11-26 05:57:43,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_16-model_00-model_states.pt... 0: [2022-11-26 05:57:43,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_16-model_00-model_states.pt. 0: [2022-11-26 05:57:43,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_17-model_00-model_states.pt... 0: [2022-11-26 05:57:43,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_17-model_00-model_states.pt. 0: [2022-11-26 05:57:43,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_18-model_00-model_states.pt... 0: [2022-11-26 05:57:43,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_18-model_00-model_states.pt. 0: [2022-11-26 05:57:43,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_19-model_00-model_states.pt... 0: [2022-11-26 05:57:44,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_19-model_00-model_states.pt. 0: [2022-11-26 05:57:44,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_20-model_00-model_states.pt... 0: [2022-11-26 05:57:44,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_20-model_00-model_states.pt. 0: [2022-11-26 05:57:44,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_21-model_00-model_states.pt... 0: [2022-11-26 05:57:44,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_21-model_00-model_states.pt. 0: [2022-11-26 05:57:44,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_22-model_00-model_states.pt... 0: [2022-11-26 05:57:44,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_22-model_00-model_states.pt. 0: [2022-11-26 05:57:44,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_23-model_00-model_states.pt... 0: [2022-11-26 05:57:44,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_23-model_00-model_states.pt. 0: [2022-11-26 05:57:44,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_24-model_00-model_states.pt... 0: [2022-11-26 05:57:44,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_24-model_00-model_states.pt. 0: [2022-11-26 05:57:44,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_25-model_00-model_states.pt... 0: [2022-11-26 05:57:44,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_25-model_00-model_states.pt. 0: [2022-11-26 05:57:44,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_26-model_00-model_states.pt... 0: [2022-11-26 05:57:44,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_26-model_00-model_states.pt. 0: [2022-11-26 05:57:44,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_27-model_00-model_states.pt... 0: [2022-11-26 05:57:44,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_27-model_00-model_states.pt. 0: [2022-11-26 05:57:44,862] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_28-model_00-model_states.pt... 0: [2022-11-26 05:57:44,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_28-model_00-model_states.pt. 0: [2022-11-26 05:57:44,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_29-model_00-model_states.pt... 0: [2022-11-26 05:57:45,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_29-model_00-model_states.pt. 0: [2022-11-26 05:57:45,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_30-model_00-model_states.pt... 0: [2022-11-26 05:57:45,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_30-model_00-model_states.pt. 0: [2022-11-26 05:57:45,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/layer_32-model_00-model_states.pt... 0: [2022-11-26 05:57:45,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/layer_32-model_00-model_states.pt. 0: [2022-11-26 05:57:45,181] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step34000/mp_rank_00_model_states.pt 0: [2022-11-26 05:57:45,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/mp_rank_00_model_states.pt... 0: [2022-11-26 05:57:45,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/mp_rank_00_model_states.pt. 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 13: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 05:57:45,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 4: [2022-11-26 05:57:45,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 05:57:45,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 05:57:45,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:57:45,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 05:57:45,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 05:57:45,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 05:57:45,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:57:45,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:57:45,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 05:57:45,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:57:45,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:57:45,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 05:57:45,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 05:57:45,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 5: [2022-11-26 05:57:45,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 05:57:45,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 05:57:45,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:57:45,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 05:57:45,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:57:45,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:57:45,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 05:57:45,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 9: [2022-11-26 05:57:45,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:57:45,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 05:57:45,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 05:57:45,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 05:57:45,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 05:57:45,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 05:57:45,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 6: [2022-11-26 05:57:45,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 05:57:45,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 05:57:45,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 05:57:45,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:57:45,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 05:57:45,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 05:57:45,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:57:45,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:57:45,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:57:45,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:57:45,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:57:45,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 05:57:45,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 05:57:45,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:57:45,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:57:45,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:57:45,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 10: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 05:57:45,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:57:45,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:57:45,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:57:45,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 1: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 5: [2022-11-26 05:57:45,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 1: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 5: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 0: [2022-11-26 05:57:45,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 5: [2022-11-26 05:57:45,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 05:57:45,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 2: [2022-11-26 05:57:45,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 05:57:45,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 05:57:45,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 05:57:45,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 05:57:45,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 05:57:45,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 05:57:45,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 05:57:45,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 05:57:45,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 13: [2022-11-26 05:57:45,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 05:57:45,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 05:57:45,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 05:57:45,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 05:57:45,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 05:57:45,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 12: [2022-11-26 05:57:45,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 05:57:45,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 05:57:45,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 4: [2022-11-26 05:57:45,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 05:57:45,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 05:57:45,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:57:45,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 05:57:45,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 05:57:45,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 15: [2022-11-26 05:57:45,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 3: [2022-11-26 05:57:45,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 05:57:45,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 05:57:45,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 05:57:45,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 05:57:45,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 05:57:45,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 05:57:45,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 6: [2022-11-26 05:57:45,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: [2022-11-26 05:57:45,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 05:57:45,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:57:45,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:57:45,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:57:45,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:57:45,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 11: [2022-11-26 05:57:45,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 05:57:45,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 05:57:45,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 14: [2022-11-26 05:57:45,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 05:57:45,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 05:57:45,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 7: [2022-11-26 05:57:45,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 05:57:45,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 05:57:45,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 05:57:45,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 8: [2022-11-26 05:57:45,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 05:57:45,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step34000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 05:57:45,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step34000 is ready now! 0: successfully saved checkpoint at iteration 34000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3669.53 15: iteration 34010/ 125429 | consumed samples: 8706560 | consumed tokens: 17831034880 | elapsed time per iteration (s): 1.42 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.097430E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.664 | TFLOPs: 29.86 | 15: iteration 34020/ 125429 | consumed samples: 8709120 | consumed tokens: 17836277760 | elapsed time per iteration (s): 1.03 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.084137E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.197 | TFLOPs: 41.02 | 15: iteration 34030/ 125429 | consumed samples: 8711680 | consumed tokens: 17841520640 | elapsed time per iteration (s): 1.04 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.074447E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.560 | TFLOPs: 40.75 | 15: iteration 34040/ 125429 | consumed samples: 8714240 | consumed tokens: 17846763520 | elapsed time per iteration (s): 1.02 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.086501E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.784 | TFLOPs: 41.28 | 15: iteration 34050/ 125429 | consumed samples: 8716800 | consumed tokens: 17852006400 | elapsed time per iteration (s): 1.03 | learning rate: 1.708E-04 | global batch size: 256 | lm loss: 2.075193E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.444 | TFLOPs: 41.06 | 15: iteration 34060/ 125429 | consumed samples: 8719360 | consumed tokens: 17857249280 | elapsed time per iteration (s): 1.17 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.047791E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.392 | TFLOPs: 36.09 | 15: iteration 34070/ 125429 | consumed samples: 8721920 | consumed tokens: 17862492160 | elapsed time per iteration (s): 1.04 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.055295E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.590 | TFLOPs: 40.75 | 15: iteration 34080/ 125429 | consumed samples: 8724480 | consumed tokens: 17867735040 | elapsed time per iteration (s): 1.05 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.095770E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.542 | TFLOPs: 40.41 | 15: iteration 34090/ 125429 | consumed samples: 8727040 | consumed tokens: 17872977920 | elapsed time per iteration (s): 1.03 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.091376E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.819 | TFLOPs: 41.12 | 15: iteration 34100/ 125429 | consumed samples: 8729600 | consumed tokens: 17878220800 | elapsed time per iteration (s): 1.07 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.073442E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.804 | TFLOPs: 39.63 | 15: iteration 34110/ 125429 | consumed samples: 8732160 | consumed tokens: 17883463680 | elapsed time per iteration (s): 1.05 | learning rate: 1.707E-04 | global batch size: 256 | lm loss: 2.060953E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.582 | TFLOPs: 40.25 | 15: iteration 34120/ 125429 | consumed samples: 8734720 | consumed tokens: 17888706560 | elapsed time per iteration (s): 1.05 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.076675E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.541 | TFLOPs: 40.41 | 15: iteration 34130/ 125429 | consumed samples: 8737280 | consumed tokens: 17893949440 | elapsed time per iteration (s): 1.06 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.097099E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.297 | TFLOPs: 39.88 | 15: iteration 34140/ 125429 | consumed samples: 8739840 | consumed tokens: 17899192320 | elapsed time per iteration (s): 1.02 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.047300E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.686 | TFLOPs: 41.43 | 15: iteration 34150/ 125429 | consumed samples: 8742400 | consumed tokens: 17904435200 | elapsed time per iteration (s): 1.04 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.071077E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.231 | TFLOPs: 40.69 | 15: iteration 34160/ 125429 | consumed samples: 8744960 | consumed tokens: 17909678080 | elapsed time per iteration (s): 1.02 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.065263E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.047 | TFLOPs: 41.32 | 15: iteration 34170/ 125429 | consumed samples: 8747520 | consumed tokens: 17914920960 | elapsed time per iteration (s): 1.11 | learning rate: 1.706E-04 | global batch size: 256 | lm loss: 2.075276E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.394 | TFLOPs: 38.07 | 15: iteration 34180/ 125429 | consumed samples: 8750080 | consumed tokens: 17920163840 | elapsed time per iteration (s): 1.04 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.105079E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.019 | TFLOPs: 40.49 | 15: iteration 34190/ 125429 | consumed samples: 8752640 | consumed tokens: 17925406720 | elapsed time per iteration (s): 1.07 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.084101E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.773 | TFLOPs: 39.62 | 15: iteration 34200/ 125429 | consumed samples: 8755200 | consumed tokens: 17930649600 | elapsed time per iteration (s): 1.04 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.048858E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.204 | TFLOPs: 40.52 | 15: iteration 34210/ 125429 | consumed samples: 8757760 | consumed tokens: 17935892480 | elapsed time per iteration (s): 1.28 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.066852E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 200.336 | TFLOPs: 33.11 | 15: iteration 34220/ 125429 | consumed samples: 8760320 | consumed tokens: 17941135360 | elapsed time per iteration (s): 1.02 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.077406E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.084 | TFLOPs: 41.33 | 15: iteration 34230/ 125429 | consumed samples: 8762880 | consumed tokens: 17946378240 | elapsed time per iteration (s): 1.26 | learning rate: 1.705E-04 | global batch size: 256 | lm loss: 2.051301E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 203.240 | TFLOPs: 33.59 | 15: iteration 34240/ 125429 | consumed samples: 8765440 | consumed tokens: 17951621120 | elapsed time per iteration (s): 1.11 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.057191E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.687 | TFLOPs: 38.12 | 15: iteration 34250/ 125429 | consumed samples: 8768000 | consumed tokens: 17956864000 | elapsed time per iteration (s): 1.08 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.069559E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.017 | TFLOPs: 39.00 | 15: iteration 34260/ 125429 | consumed samples: 8770560 | consumed tokens: 17962106880 | elapsed time per iteration (s): 1.10 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.073823E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.358 | TFLOPs: 38.40 | 15: iteration 34270/ 125429 | consumed samples: 8773120 | consumed tokens: 17967349760 | elapsed time per iteration (s): 1.10 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.092025E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.523 | TFLOPs: 38.59 | 15: iteration 34280/ 125429 | consumed samples: 8775680 | consumed tokens: 17972592640 | elapsed time per iteration (s): 1.25 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.083292E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 205.120 | TFLOPs: 33.90 | 15: iteration 34290/ 125429 | consumed samples: 8778240 | consumed tokens: 17977835520 | elapsed time per iteration (s): 1.05 | learning rate: 1.704E-04 | global batch size: 256 | lm loss: 2.096917E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.941 | TFLOPs: 40.15 | 15: iteration 34300/ 125429 | consumed samples: 8780800 | consumed tokens: 17983078400 | elapsed time per iteration (s): 1.04 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.095827E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.481 | TFLOPs: 40.57 | 15: iteration 34310/ 125429 | consumed samples: 8783360 | consumed tokens: 17988321280 | elapsed time per iteration (s): 1.03 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.074207E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.030 | TFLOPs: 40.99 | 15: iteration 34320/ 125429 | consumed samples: 8785920 | consumed tokens: 17993564160 | elapsed time per iteration (s): 1.04 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.116045E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.630 | TFLOPs: 40.59 | 15: iteration 34330/ 125429 | consumed samples: 8788480 | consumed tokens: 17998807040 | elapsed time per iteration (s): 1.05 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.095353E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.960 | TFLOPs: 40.15 | 15: iteration 34340/ 125429 | consumed samples: 8791040 | consumed tokens: 18004049920 | elapsed time per iteration (s): 1.06 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.051809E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.541 | TFLOPs: 39.92 | 15: iteration 34350/ 125429 | consumed samples: 8793600 | consumed tokens: 18009292800 | elapsed time per iteration (s): 1.03 | learning rate: 1.703E-04 | global batch size: 256 | lm loss: 2.054266E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.257 | TFLOPs: 41.19 | 15: iteration 34360/ 125429 | consumed samples: 8796160 | consumed tokens: 18014535680 | elapsed time per iteration (s): 1.04 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.048421E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.133 | TFLOPs: 40.51 | 15: iteration 34370/ 125429 | consumed samples: 8798720 | consumed tokens: 18019778560 | elapsed time per iteration (s): 1.04 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.090077E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.124 | TFLOPs: 40.84 | 15: iteration 34380/ 125429 | consumed samples: 8801280 | consumed tokens: 18025021440 | elapsed time per iteration (s): 1.05 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.097971E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.936 | TFLOPs: 40.48 | 15: iteration 34390/ 125429 | consumed samples: 8803840 | consumed tokens: 18030264320 | elapsed time per iteration (s): 1.03 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.051593E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.317 | TFLOPs: 41.20 | 15: iteration 34400/ 125429 | consumed samples: 8806400 | consumed tokens: 18035507200 | elapsed time per iteration (s): 1.05 | learning rate: 1.702E-04 | global batch size: 256 | lm loss: 2.098000E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.855 | TFLOPs: 40.30 | 15: iteration 34410/ 125429 | consumed samples: 8808960 | consumed tokens: 18040750080 | elapsed time per iteration (s): 1.08 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.091664E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.002 | TFLOPs: 39.17 | 15: iteration 34420/ 125429 | consumed samples: 8811520 | consumed tokens: 18045992960 | elapsed time per iteration (s): 1.03 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.080406E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.425 | TFLOPs: 41.22 | 15: iteration 34430/ 125429 | consumed samples: 8814080 | consumed tokens: 18051235840 | elapsed time per iteration (s): 1.06 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.093166E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.509 | TFLOPs: 39.75 | 15: iteration 34440/ 125429 | consumed samples: 8816640 | consumed tokens: 18056478720 | elapsed time per iteration (s): 1.03 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.080664E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.518 | TFLOPs: 40.90 | 15: iteration 34450/ 125429 | consumed samples: 8819200 | consumed tokens: 18061721600 | elapsed time per iteration (s): 1.03 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.084875E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.787 | TFLOPs: 41.11 | 15: iteration 34460/ 125429 | consumed samples: 8821760 | consumed tokens: 18066964480 | elapsed time per iteration (s): 1.02 | learning rate: 1.701E-04 | global batch size: 256 | lm loss: 2.049511E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.028 | TFLOPs: 41.32 | 15: iteration 34470/ 125429 | consumed samples: 8824320 | consumed tokens: 18072207360 | elapsed time per iteration (s): 1.05 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.048706E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.323 | TFLOPs: 40.21 | 15: iteration 34480/ 125429 | consumed samples: 8826880 | consumed tokens: 18077450240 | elapsed time per iteration (s): 1.04 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.049367E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.261 | TFLOPs: 40.53 | 15: iteration 34490/ 125429 | consumed samples: 8829440 | consumed tokens: 18082693120 | elapsed time per iteration (s): 1.06 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.070738E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.853 | TFLOPs: 39.97 | 15: iteration 34500/ 125429 | consumed samples: 8832000 | consumed tokens: 18087936000 | elapsed time per iteration (s): 1.05 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.034770E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.004 | TFLOPs: 40.16 | 15: iteration 34510/ 125429 | consumed samples: 8834560 | consumed tokens: 18093178880 | elapsed time per iteration (s): 1.04 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.046864E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.173 | TFLOPs: 40.68 | 15: iteration 34520/ 125429 | consumed samples: 8837120 | consumed tokens: 18098421760 | elapsed time per iteration (s): 1.03 | learning rate: 1.700E-04 | global batch size: 256 | lm loss: 2.059622E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.225 | TFLOPs: 41.02 | 15: iteration 34530/ 125429 | consumed samples: 8839680 | consumed tokens: 18103664640 | elapsed time per iteration (s): 1.05 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.078503E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.366 | TFLOPs: 40.38 | 15: iteration 34540/ 125429 | consumed samples: 8842240 | consumed tokens: 18108907520 | elapsed time per iteration (s): 1.10 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.059868E+00 | grad norm: 0.292 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.903 | TFLOPs: 38.49 | 15: iteration 34550/ 125429 | consumed samples: 8844800 | consumed tokens: 18114150400 | elapsed time per iteration (s): 1.05 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.829204E+00 | grad norm: 1.617 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.941 | TFLOPs: 40.15 | 15: iteration 34560/ 125429 | consumed samples: 8847360 | consumed tokens: 18119393280 | elapsed time per iteration (s): 1.06 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.156553E+00 | grad norm: 0.252 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.030 | TFLOPs: 39.83 | 15: iteration 34570/ 125429 | consumed samples: 8849920 | consumed tokens: 18124636160 | elapsed time per iteration (s): 1.03 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.110189E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.519 | TFLOPs: 40.90 | 15: iteration 34580/ 125429 | consumed samples: 8852480 | consumed tokens: 18129879040 | elapsed time per iteration (s): 1.02 | learning rate: 1.699E-04 | global batch size: 256 | lm loss: 2.122898E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.785 | TFLOPs: 41.28 | 15: iteration 34590/ 125429 | consumed samples: 8855040 | consumed tokens: 18135121920 | elapsed time per iteration (s): 1.09 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.062619E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.628 | TFLOPs: 38.94 | 15: iteration 34600/ 125429 | consumed samples: 8857600 | consumed tokens: 18140364800 | elapsed time per iteration (s): 1.04 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.079736E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.994 | TFLOPs: 40.82 | 15: iteration 34610/ 125429 | consumed samples: 8860160 | consumed tokens: 18145607680 | elapsed time per iteration (s): 1.03 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.034684E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.517 | TFLOPs: 41.07 | 15: iteration 34620/ 125429 | consumed samples: 8862720 | consumed tokens: 18150850560 | elapsed time per iteration (s): 1.03 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.066356E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.945 | TFLOPs: 40.97 | 15: iteration 34630/ 125429 | consumed samples: 8865280 | consumed tokens: 18156093440 | elapsed time per iteration (s): 1.04 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.072418E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.305 | TFLOPs: 40.70 | 15: iteration 34640/ 125429 | consumed samples: 8867840 | consumed tokens: 18161336320 | elapsed time per iteration (s): 1.05 | learning rate: 1.698E-04 | global batch size: 256 | lm loss: 2.085015E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.228 | TFLOPs: 40.20 | 15: iteration 34650/ 125429 | consumed samples: 8870400 | consumed tokens: 18166579200 | elapsed time per iteration (s): 1.02 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.085501E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.183 | TFLOPs: 41.34 | 15: iteration 34660/ 125429 | consumed samples: 8872960 | consumed tokens: 18171822080 | elapsed time per iteration (s): 1.03 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.074793E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.469 | TFLOPs: 40.90 | 15: iteration 34670/ 125429 | consumed samples: 8875520 | consumed tokens: 18177064960 | elapsed time per iteration (s): 1.05 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.083516E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.063 | TFLOPs: 40.33 | 15: iteration 34680/ 125429 | consumed samples: 8878080 | consumed tokens: 18182307840 | elapsed time per iteration (s): 1.14 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.077797E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.469 | TFLOPs: 37.26 | 15: iteration 34690/ 125429 | consumed samples: 8880640 | consumed tokens: 18187550720 | elapsed time per iteration (s): 1.05 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.078155E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.544 | TFLOPs: 40.41 | 15: iteration 34700/ 125429 | consumed samples: 8883200 | consumed tokens: 18192793600 | elapsed time per iteration (s): 1.06 | learning rate: 1.697E-04 | global batch size: 256 | lm loss: 2.064595E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.342 | TFLOPs: 40.05 | 15: iteration 34710/ 125429 | consumed samples: 8885760 | consumed tokens: 18198036480 | elapsed time per iteration (s): 1.04 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.077960E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.650 | TFLOPs: 40.60 | 15: iteration 34720/ 125429 | consumed samples: 8888320 | consumed tokens: 18203279360 | elapsed time per iteration (s): 1.04 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.083146E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.169 | TFLOPs: 40.52 | 15: iteration 34730/ 125429 | consumed samples: 8890880 | consumed tokens: 18208522240 | elapsed time per iteration (s): 1.04 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.090363E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.926 | TFLOPs: 40.81 | 15: iteration 34740/ 125429 | consumed samples: 8893440 | consumed tokens: 18213765120 | elapsed time per iteration (s): 1.03 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.081741E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.559 | TFLOPs: 40.91 | 15: iteration 34750/ 125429 | consumed samples: 8896000 | consumed tokens: 18219008000 | elapsed time per iteration (s): 1.06 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.077847E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.737 | TFLOPs: 39.95 | 15: iteration 34760/ 125429 | consumed samples: 8898560 | consumed tokens: 18224250880 | elapsed time per iteration (s): 1.10 | learning rate: 1.696E-04 | global batch size: 256 | lm loss: 2.068932E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.292 | TFLOPs: 38.55 | 15: iteration 34770/ 125429 | consumed samples: 8901120 | consumed tokens: 18229493760 | elapsed time per iteration (s): 1.04 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.056911E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.570 | TFLOPs: 40.75 | 15: iteration 34780/ 125429 | consumed samples: 8903680 | consumed tokens: 18234736640 | elapsed time per iteration (s): 1.03 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.041285E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.546 | TFLOPs: 41.24 | 15: iteration 34790/ 125429 | consumed samples: 8906240 | consumed tokens: 18239979520 | elapsed time per iteration (s): 1.03 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.061354E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.434 | TFLOPs: 41.06 | 15: iteration 34800/ 125429 | consumed samples: 8908800 | consumed tokens: 18245222400 | elapsed time per iteration (s): 1.04 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.068153E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.905 | TFLOPs: 40.80 | 15: iteration 34810/ 125429 | consumed samples: 8911360 | consumed tokens: 18250465280 | elapsed time per iteration (s): 1.05 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.078114E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.520 | TFLOPs: 40.41 | 15: iteration 34820/ 125429 | consumed samples: 8913920 | consumed tokens: 18255708160 | elapsed time per iteration (s): 1.03 | learning rate: 1.695E-04 | global batch size: 256 | lm loss: 2.048569E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.727 | TFLOPs: 40.94 | 15: iteration 34830/ 125429 | consumed samples: 8916480 | consumed tokens: 18260951040 | elapsed time per iteration (s): 1.05 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.053990E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.735 | TFLOPs: 40.44 | 15: iteration 34840/ 125429 | consumed samples: 8919040 | consumed tokens: 18266193920 | elapsed time per iteration (s): 1.02 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.076657E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.812 | TFLOPs: 41.28 | 15: iteration 34850/ 125429 | consumed samples: 8921600 | consumed tokens: 18271436800 | elapsed time per iteration (s): 1.04 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.087396E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.998 | TFLOPs: 40.65 | 15: iteration 34860/ 125429 | consumed samples: 8924160 | consumed tokens: 18276679680 | elapsed time per iteration (s): 1.08 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.070304E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.507 | TFLOPs: 39.25 | 15: iteration 34870/ 125429 | consumed samples: 8926720 | consumed tokens: 18281922560 | elapsed time per iteration (s): 1.05 | learning rate: 1.694E-04 | global batch size: 256 | lm loss: 2.112885E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.469 | TFLOPs: 40.24 | 15: iteration 34880/ 125429 | consumed samples: 8929280 | consumed tokens: 18287165440 | elapsed time per iteration (s): 1.03 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.049262E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.327 | TFLOPs: 41.20 | 15: iteration 34890/ 125429 | consumed samples: 8931840 | consumed tokens: 18292408320 | elapsed time per iteration (s): 1.06 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.089645E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.928 | TFLOPs: 39.98 | 15: iteration 34900/ 125429 | consumed samples: 8934400 | consumed tokens: 18297651200 | elapsed time per iteration (s): 1.03 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.085945E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.608 | TFLOPs: 41.08 | 15: iteration 34910/ 125429 | consumed samples: 8936960 | consumed tokens: 18302894080 | elapsed time per iteration (s): 1.03 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.083396E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.351 | TFLOPs: 40.88 | 15: iteration 34920/ 125429 | consumed samples: 8939520 | consumed tokens: 18308136960 | elapsed time per iteration (s): 1.05 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.057709E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.524 | TFLOPs: 40.24 | 15: iteration 34930/ 125429 | consumed samples: 8942080 | consumed tokens: 18313379840 | elapsed time per iteration (s): 1.06 | learning rate: 1.693E-04 | global batch size: 256 | lm loss: 2.078639E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.901 | TFLOPs: 39.81 | 15: iteration 34940/ 125429 | consumed samples: 8944640 | consumed tokens: 18318622720 | elapsed time per iteration (s): 1.05 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.062443E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.660 | TFLOPs: 40.27 | 15: iteration 34950/ 125429 | consumed samples: 8947200 | consumed tokens: 18323865600 | elapsed time per iteration (s): 1.03 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.069236E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.968 | TFLOPs: 40.98 | 15: iteration 34960/ 125429 | consumed samples: 8949760 | consumed tokens: 18329108480 | elapsed time per iteration (s): 1.22 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.065216E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.193 | TFLOPs: 34.74 | 15: iteration 34970/ 125429 | consumed samples: 8952320 | consumed tokens: 18334351360 | elapsed time per iteration (s): 1.05 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.090446E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.660 | TFLOPs: 40.10 | 15: iteration 34980/ 125429 | consumed samples: 8954880 | consumed tokens: 18339594240 | elapsed time per iteration (s): 1.05 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.084967E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.112 | TFLOPs: 40.18 | 15: iteration 34990/ 125429 | consumed samples: 8957440 | consumed tokens: 18344837120 | elapsed time per iteration (s): 1.07 | learning rate: 1.692E-04 | global batch size: 256 | lm loss: 2.106025E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.810 | TFLOPs: 39.63 | 15: iteration 35000/ 125429 | consumed samples: 8960000 | consumed tokens: 18350080000 | elapsed time per iteration (s): 1.10 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.097562E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.248 | TFLOPs: 38.55 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 35000 | lm loss value: 2.095829E+00 | lm loss PPL: 8.132180E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 35000 to checkpoints_1b5 0: [2022-11-26 06:15:24,050] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step35000 is begin to save! 0: [2022-11-26 06:15:24,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:15:24,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:15:24,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:15:24,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:15:24,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:15:24,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:15:24,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:15:24,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:15:24,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:15:24,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:15:24,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:15:24,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:15:24,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:15:24,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:15:24,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:15:25,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:15:25,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:15:25,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:15:25,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:15:25,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:15:25,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:15:25,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:15:25,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:15:25,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:15:25,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:15:25,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:15:25,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:15:25,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:15:25,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:15:25,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:15:25,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:15:25,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:15:25,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:15:26,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:15:26,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:15:26,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:15:26,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:15:26,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:15:26,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:15:26,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:15:26,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:15:26,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:15:26,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:15:26,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:15:26,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:15:26,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:15:26,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:15:26,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:15:26,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:15:26,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:15:26,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:15:27,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:15:27,051] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:15:27,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:15:27,164] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_29-model_00-model_states.pt... 0: [2022-11-26 06:15:27,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_29-model_00-model_states.pt. 0: [2022-11-26 06:15:27,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:15:27,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:15:27,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/layer_32-model_00-model_states.pt... 0: [2022-11-26 06:15:27,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/layer_32-model_00-model_states.pt. 0: [2022-11-26 06:15:27,393] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step35000/mp_rank_00_model_states.pt 0: [2022-11-26 06:15:27,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:15:27,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:15:27,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step35000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:15:27,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 06:15:27,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 06:15:27,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 06:15:27,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 06:15:27,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:15:27,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 06:15:27,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 06:15:27,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 06:15:27,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 06:15:27,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 06:15:27,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:15:27,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:15:27,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:15:27,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 06:15:27,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:15:27,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 06:15:27,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 06:15:27,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 06:15:27,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:15:27,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:15:27,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:15:27,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:15:27,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:15:27,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 06:15:27,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 06:15:27,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 06:15:27,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:15:27,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:15:27,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:15:27,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:15:27,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 06:15:27,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 10: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 06:15:27,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:15:27,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:15:27,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 7: [2022-11-26 06:15:27,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 2: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 3: [2022-11-26 06:15:27,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:15:27,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:15:27,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 9: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 12: [2022-11-26 06:15:27,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:15:27,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 06:15:27,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:15:27,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:15:27,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 06:15:27,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 1: [2022-11-26 06:15:27,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:15:27,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:15:27,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 11: [2022-11-26 06:15:27,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 06:15:27,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:15:27,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:15:27,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:15:27,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:15:27,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 6: [2022-11-26 06:15:27,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:15:27,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:15:27,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 13: [2022-11-26 06:15:27,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:15:27,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 06:15:27,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 4: [2022-11-26 06:15:27,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:15:27,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:15:27,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: [2022-11-26 06:15:27,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:15:27,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:15:27,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 8: [2022-11-26 06:15:27,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 14: [2022-11-26 06:15:27,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 06:15:27,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:15:27,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:15:27,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step35000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 5: [2022-11-26 06:15:27,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step35000 is ready now! 0: successfully saved checkpoint at iteration 35000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3935.00 15: iteration 35010/ 125429 | consumed samples: 8962560 | consumed tokens: 18355322880 | elapsed time per iteration (s): 1.48 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.083826E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.393 | TFLOPs: 28.49 | 15: iteration 35020/ 125429 | consumed samples: 8965120 | consumed tokens: 18360565760 | elapsed time per iteration (s): 1.09 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.092482E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.028 | TFLOPs: 38.84 | 15: iteration 35030/ 125429 | consumed samples: 8967680 | consumed tokens: 18365808640 | elapsed time per iteration (s): 1.07 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.091177E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.126 | TFLOPs: 39.52 | 15: iteration 35040/ 125429 | consumed samples: 8970240 | consumed tokens: 18371051520 | elapsed time per iteration (s): 1.04 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.076794E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.127 | TFLOPs: 40.67 | 15: iteration 35050/ 125429 | consumed samples: 8972800 | consumed tokens: 18376294400 | elapsed time per iteration (s): 1.05 | learning rate: 1.691E-04 | global batch size: 256 | lm loss: 2.099749E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.675 | TFLOPs: 40.27 | 15: iteration 35060/ 125429 | consumed samples: 8975360 | consumed tokens: 18381537280 | elapsed time per iteration (s): 1.05 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.069448E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.045 | TFLOPs: 40.17 | 15: iteration 35070/ 125429 | consumed samples: 8977920 | consumed tokens: 18386780160 | elapsed time per iteration (s): 1.12 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.087416E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.393 | TFLOPs: 37.91 | 15: iteration 35080/ 125429 | consumed samples: 8980480 | consumed tokens: 18392023040 | elapsed time per iteration (s): 1.06 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.048064E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.122 | TFLOPs: 40.01 | 15: iteration 35090/ 125429 | consumed samples: 8983040 | consumed tokens: 18397265920 | elapsed time per iteration (s): 1.15 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.065522E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.950 | TFLOPs: 36.84 | 15: iteration 35100/ 125429 | consumed samples: 8985600 | consumed tokens: 18402508800 | elapsed time per iteration (s): 1.05 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.046373E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.310 | TFLOPs: 40.37 | 15: iteration 35110/ 125429 | consumed samples: 8988160 | consumed tokens: 18407751680 | elapsed time per iteration (s): 1.04 | learning rate: 1.690E-04 | global batch size: 256 | lm loss: 2.054258E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.075 | TFLOPs: 40.50 | 15: iteration 35120/ 125429 | consumed samples: 8990720 | consumed tokens: 18412994560 | elapsed time per iteration (s): 1.06 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.066534E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.891 | TFLOPs: 39.97 | 15: iteration 35130/ 125429 | consumed samples: 8993280 | consumed tokens: 18418237440 | elapsed time per iteration (s): 1.03 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.079153E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.526 | TFLOPs: 41.07 | 15: iteration 35140/ 125429 | consumed samples: 8995840 | consumed tokens: 18423480320 | elapsed time per iteration (s): 1.05 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.077474E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.487 | TFLOPs: 40.40 | 15: iteration 35150/ 125429 | consumed samples: 8998400 | consumed tokens: 18428723200 | elapsed time per iteration (s): 1.04 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.075111E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.025 | TFLOPs: 40.49 | 15: iteration 35160/ 125429 | consumed samples: 9000960 | consumed tokens: 18433966080 | elapsed time per iteration (s): 1.07 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.061710E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.945 | TFLOPs: 39.65 | 15: iteration 35170/ 125429 | consumed samples: 9003520 | consumed tokens: 18439208960 | elapsed time per iteration (s): 1.07 | learning rate: 1.689E-04 | global batch size: 256 | lm loss: 2.050406E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.211 | TFLOPs: 39.70 | 15: iteration 35180/ 125429 | consumed samples: 9006080 | consumed tokens: 18444451840 | elapsed time per iteration (s): 1.08 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.063853E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.792 | TFLOPs: 39.13 | 15: iteration 35190/ 125429 | consumed samples: 9008640 | consumed tokens: 18449694720 | elapsed time per iteration (s): 1.05 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.080303E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.832 | TFLOPs: 40.30 | 15: iteration 35200/ 125429 | consumed samples: 9011200 | consumed tokens: 18454937600 | elapsed time per iteration (s): 1.03 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.083022E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.411 | TFLOPs: 40.89 | 15: iteration 35210/ 125429 | consumed samples: 9013760 | consumed tokens: 18460180480 | elapsed time per iteration (s): 1.07 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.071629E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.420 | TFLOPs: 39.57 | 15: iteration 35220/ 125429 | consumed samples: 9016320 | consumed tokens: 18465423360 | elapsed time per iteration (s): 1.04 | learning rate: 1.688E-04 | global batch size: 256 | lm loss: 2.075595E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.015 | TFLOPs: 40.82 | 15: iteration 35230/ 125429 | consumed samples: 9018880 | consumed tokens: 18470666240 | elapsed time per iteration (s): 1.03 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.073841E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.849 | TFLOPs: 41.12 | 15: iteration 35240/ 125429 | consumed samples: 9021440 | consumed tokens: 18475909120 | elapsed time per iteration (s): 1.03 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.057898E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.258 | TFLOPs: 41.19 | 15: iteration 35250/ 125429 | consumed samples: 9024000 | consumed tokens: 18481152000 | elapsed time per iteration (s): 1.04 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.070433E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.240 | TFLOPs: 40.69 | 15: iteration 35260/ 125429 | consumed samples: 9026560 | consumed tokens: 18486394880 | elapsed time per iteration (s): 1.03 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.051389E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.591 | TFLOPs: 40.92 | 15: iteration 35270/ 125429 | consumed samples: 9029120 | consumed tokens: 18491637760 | elapsed time per iteration (s): 1.05 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.061360E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.919 | TFLOPs: 40.47 | 15: iteration 35280/ 125429 | consumed samples: 9031680 | consumed tokens: 18496880640 | elapsed time per iteration (s): 3.11 | learning rate: 1.687E-04 | global batch size: 256 | lm loss: 2.085324E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 82.306 | TFLOPs: 13.60 | 15: iteration 35290/ 125429 | consumed samples: 9034240 | consumed tokens: 18502123520 | elapsed time per iteration (s): 1.08 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.047333E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.489 | TFLOPs: 39.25 | 15: iteration 35300/ 125429 | consumed samples: 9036800 | consumed tokens: 18507366400 | elapsed time per iteration (s): 1.05 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.080841E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.164 | TFLOPs: 40.18 | 15: iteration 35310/ 125429 | consumed samples: 9039360 | consumed tokens: 18512609280 | elapsed time per iteration (s): 1.10 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.058190E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.242 | TFLOPs: 38.38 | 15: iteration 35320/ 125429 | consumed samples: 9041920 | consumed tokens: 18517852160 | elapsed time per iteration (s): 1.13 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.091762E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.254 | TFLOPs: 37.56 | 15: iteration 35330/ 125429 | consumed samples: 9044480 | consumed tokens: 18523095040 | elapsed time per iteration (s): 1.05 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.043520E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.802 | TFLOPs: 40.46 | 15: iteration 35340/ 125429 | consumed samples: 9047040 | consumed tokens: 18528337920 | elapsed time per iteration (s): 1.05 | learning rate: 1.686E-04 | global batch size: 256 | lm loss: 2.083843E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.232 | TFLOPs: 40.20 | 15: iteration 35350/ 125429 | consumed samples: 9049600 | consumed tokens: 18533580800 | elapsed time per iteration (s): 1.07 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.065227E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.267 | TFLOPs: 39.54 | 15: iteration 35360/ 125429 | consumed samples: 9052160 | consumed tokens: 18538823680 | elapsed time per iteration (s): 1.04 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.054352E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.452 | TFLOPs: 40.56 | 15: iteration 35370/ 125429 | consumed samples: 9054720 | consumed tokens: 18544066560 | elapsed time per iteration (s): 1.05 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.053366E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.265 | TFLOPs: 40.20 | 15: iteration 35380/ 125429 | consumed samples: 9057280 | consumed tokens: 18549309440 | elapsed time per iteration (s): 1.07 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.076142E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.988 | TFLOPs: 39.66 | 15: iteration 35390/ 125429 | consumed samples: 9059840 | consumed tokens: 18554552320 | elapsed time per iteration (s): 1.03 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.086181E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.805 | TFLOPs: 40.95 | 15: iteration 35400/ 125429 | consumed samples: 9062400 | consumed tokens: 18559795200 | elapsed time per iteration (s): 1.09 | learning rate: 1.685E-04 | global batch size: 256 | lm loss: 2.086026E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.700 | TFLOPs: 38.95 | 15: iteration 35410/ 125429 | consumed samples: 9064960 | consumed tokens: 18565038080 | elapsed time per iteration (s): 1.07 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.049004E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.285 | TFLOPs: 39.54 | 15: iteration 35420/ 125429 | consumed samples: 9067520 | consumed tokens: 18570280960 | elapsed time per iteration (s): 1.08 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.078126E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.771 | TFLOPs: 39.13 | 15: iteration 35430/ 125429 | consumed samples: 9070080 | consumed tokens: 18575523840 | elapsed time per iteration (s): 1.05 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.062277E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.509 | TFLOPs: 40.41 | 15: iteration 35440/ 125429 | consumed samples: 9072640 | consumed tokens: 18580766720 | elapsed time per iteration (s): 1.07 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.034788E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.048 | TFLOPs: 39.50 | 15: iteration 35450/ 125429 | consumed samples: 9075200 | consumed tokens: 18586009600 | elapsed time per iteration (s): 1.09 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.039068E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.520 | TFLOPs: 38.92 | 15: iteration 35460/ 125429 | consumed samples: 9077760 | consumed tokens: 18591252480 | elapsed time per iteration (s): 1.03 | learning rate: 1.684E-04 | global batch size: 256 | lm loss: 2.053924E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.656 | TFLOPs: 40.93 | 15: iteration 35470/ 125429 | consumed samples: 9080320 | consumed tokens: 18596495360 | elapsed time per iteration (s): 1.05 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.071700E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.065 | TFLOPs: 40.17 | 15: iteration 35480/ 125429 | consumed samples: 9082880 | consumed tokens: 18601738240 | elapsed time per iteration (s): 1.05 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.083086E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.598 | TFLOPs: 40.42 | 15: iteration 35490/ 125429 | consumed samples: 9085440 | consumed tokens: 18606981120 | elapsed time per iteration (s): 1.07 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.093506E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.318 | TFLOPs: 39.38 | 15: iteration 35500/ 125429 | consumed samples: 9088000 | consumed tokens: 18612224000 | elapsed time per iteration (s): 1.06 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.068077E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.791 | TFLOPs: 39.96 | 15: iteration 35510/ 125429 | consumed samples: 9090560 | consumed tokens: 18617466880 | elapsed time per iteration (s): 1.03 | learning rate: 1.683E-04 | global batch size: 256 | lm loss: 2.068262E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.171 | TFLOPs: 41.18 | 15: iteration 35520/ 125429 | consumed samples: 9093120 | consumed tokens: 18622709760 | elapsed time per iteration (s): 1.04 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.067147E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.911 | TFLOPs: 40.80 | 15: iteration 35530/ 125429 | consumed samples: 9095680 | consumed tokens: 18627952640 | elapsed time per iteration (s): 1.07 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.061748E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.248 | TFLOPs: 39.54 | 15: iteration 35540/ 125429 | consumed samples: 9098240 | consumed tokens: 18633195520 | elapsed time per iteration (s): 1.05 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.053649E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.469 | TFLOPs: 40.24 | 15: iteration 35550/ 125429 | consumed samples: 9100800 | consumed tokens: 18638438400 | elapsed time per iteration (s): 1.03 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.082878E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.516 | TFLOPs: 41.07 | 15: iteration 35560/ 125429 | consumed samples: 9103360 | consumed tokens: 18643681280 | elapsed time per iteration (s): 1.05 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.080284E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.802 | TFLOPs: 40.12 | 15: iteration 35570/ 125429 | consumed samples: 9105920 | consumed tokens: 18648924160 | elapsed time per iteration (s): 1.06 | learning rate: 1.682E-04 | global batch size: 256 | lm loss: 2.093394E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.411 | TFLOPs: 39.90 | 15: iteration 35580/ 125429 | consumed samples: 9108480 | consumed tokens: 18654167040 | elapsed time per iteration (s): 1.06 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.058669E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.878 | TFLOPs: 39.81 | 15: iteration 35590/ 125429 | consumed samples: 9111040 | consumed tokens: 18659409920 | elapsed time per iteration (s): 1.05 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.022635E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.971 | TFLOPs: 40.48 | 15: iteration 35600/ 125429 | consumed samples: 9113600 | consumed tokens: 18664652800 | elapsed time per iteration (s): 1.04 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.077300E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.592 | TFLOPs: 40.75 | 15: iteration 35610/ 125429 | consumed samples: 9116160 | consumed tokens: 18669895680 | elapsed time per iteration (s): 1.38 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.072733E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 185.599 | TFLOPs: 30.67 | 15: iteration 35620/ 125429 | consumed samples: 9118720 | consumed tokens: 18675138560 | elapsed time per iteration (s): 1.05 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.059888E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.018 | TFLOPs: 40.16 | 15: iteration 35630/ 125429 | consumed samples: 9121280 | consumed tokens: 18680381440 | elapsed time per iteration (s): 1.07 | learning rate: 1.681E-04 | global batch size: 256 | lm loss: 2.074339E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.186 | TFLOPs: 39.69 | 15: iteration 35640/ 125429 | consumed samples: 9123840 | consumed tokens: 18685624320 | elapsed time per iteration (s): 1.04 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.060730E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.042 | TFLOPs: 40.66 | 15: iteration 35650/ 125429 | consumed samples: 9126400 | consumed tokens: 18690867200 | elapsed time per iteration (s): 1.04 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.056737E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.971 | TFLOPs: 40.81 | 15: iteration 35660/ 125429 | consumed samples: 9128960 | consumed tokens: 18696110080 | elapsed time per iteration (s): 1.07 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.081518E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.301 | TFLOPs: 39.38 | 15: iteration 35670/ 125429 | consumed samples: 9131520 | consumed tokens: 18701352960 | elapsed time per iteration (s): 1.05 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.065346E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.008 | TFLOPs: 40.16 | 15: iteration 35680/ 125429 | consumed samples: 9134080 | consumed tokens: 18706595840 | elapsed time per iteration (s): 1.10 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.078872E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.375 | TFLOPs: 38.57 | 15: iteration 35690/ 125429 | consumed samples: 9136640 | consumed tokens: 18711838720 | elapsed time per iteration (s): 1.06 | learning rate: 1.680E-04 | global batch size: 256 | lm loss: 2.038600E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.544 | TFLOPs: 40.08 | 15: iteration 35700/ 125429 | consumed samples: 9139200 | consumed tokens: 18717081600 | elapsed time per iteration (s): 1.08 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.051292E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.998 | TFLOPs: 39.33 | 15: iteration 35710/ 125429 | consumed samples: 9141760 | consumed tokens: 18722324480 | elapsed time per iteration (s): 1.04 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.079351E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.067 | TFLOPs: 40.66 | 15: iteration 35720/ 125429 | consumed samples: 9144320 | consumed tokens: 18727567360 | elapsed time per iteration (s): 1.03 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.035273E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.951 | TFLOPs: 41.14 | 15: iteration 35730/ 125429 | consumed samples: 9146880 | consumed tokens: 18732810240 | elapsed time per iteration (s): 1.06 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.088028E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.458 | TFLOPs: 40.07 | 15: iteration 35740/ 125429 | consumed samples: 9149440 | consumed tokens: 18738053120 | elapsed time per iteration (s): 1.06 | learning rate: 1.679E-04 | global batch size: 256 | lm loss: 2.076829E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.279 | TFLOPs: 40.04 | 15: iteration 35750/ 125429 | consumed samples: 9152000 | consumed tokens: 18743296000 | elapsed time per iteration (s): 1.03 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.074790E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.491 | TFLOPs: 41.07 | 15: iteration 35760/ 125429 | consumed samples: 9154560 | consumed tokens: 18748538880 | elapsed time per iteration (s): 1.04 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.056629E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.768 | TFLOPs: 40.61 | 15: iteration 35770/ 125429 | consumed samples: 9157120 | consumed tokens: 18753781760 | elapsed time per iteration (s): 1.04 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.050959E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.080 | TFLOPs: 40.67 | 15: iteration 35780/ 125429 | consumed samples: 9159680 | consumed tokens: 18759024640 | elapsed time per iteration (s): 1.05 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.054850E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.788 | TFLOPs: 40.45 | 15: iteration 35790/ 125429 | consumed samples: 9162240 | consumed tokens: 18764267520 | elapsed time per iteration (s): 1.04 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.048365E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.502 | TFLOPs: 40.74 | 15: iteration 35800/ 125429 | consumed samples: 9164800 | consumed tokens: 18769510400 | elapsed time per iteration (s): 1.04 | learning rate: 1.678E-04 | global batch size: 256 | lm loss: 2.066602E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.041 | TFLOPs: 40.66 | 15: iteration 35810/ 125429 | consumed samples: 9167360 | consumed tokens: 18774753280 | elapsed time per iteration (s): 1.05 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.072241E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.933 | TFLOPs: 40.48 | 15: iteration 35820/ 125429 | consumed samples: 9169920 | consumed tokens: 18779996160 | elapsed time per iteration (s): 1.10 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.058355E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.704 | TFLOPs: 38.62 | 15: iteration 35830/ 125429 | consumed samples: 9172480 | consumed tokens: 18785239040 | elapsed time per iteration (s): 1.10 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.037204E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.177 | TFLOPs: 38.37 | 15: iteration 35840/ 125429 | consumed samples: 9175040 | consumed tokens: 18790481920 | elapsed time per iteration (s): 1.08 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.039635E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.413 | TFLOPs: 39.07 | 15: iteration 35850/ 125429 | consumed samples: 9177600 | consumed tokens: 18795724800 | elapsed time per iteration (s): 1.04 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.080453E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.143 | TFLOPs: 40.51 | 15: iteration 35860/ 125429 | consumed samples: 9180160 | consumed tokens: 18800967680 | elapsed time per iteration (s): 1.07 | learning rate: 1.677E-04 | global batch size: 256 | lm loss: 2.060844E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.015 | TFLOPs: 39.66 | 15: iteration 35870/ 125429 | consumed samples: 9182720 | consumed tokens: 18806210560 | elapsed time per iteration (s): 1.09 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.096259E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.052 | TFLOPs: 38.84 | 15: iteration 35880/ 125429 | consumed samples: 9185280 | consumed tokens: 18811453440 | elapsed time per iteration (s): 1.04 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.058692E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.978 | TFLOPs: 40.82 | 15: iteration 35890/ 125429 | consumed samples: 9187840 | consumed tokens: 18816696320 | elapsed time per iteration (s): 1.04 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.060601E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.741 | TFLOPs: 40.78 | 15: iteration 35900/ 125429 | consumed samples: 9190400 | consumed tokens: 18821939200 | elapsed time per iteration (s): 1.06 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.067165E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.983 | TFLOPs: 39.82 | 15: iteration 35910/ 125429 | consumed samples: 9192960 | consumed tokens: 18827182080 | elapsed time per iteration (s): 1.05 | learning rate: 1.676E-04 | global batch size: 256 | lm loss: 2.056902E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.639 | TFLOPs: 40.43 | 15: iteration 35920/ 125429 | consumed samples: 9195520 | consumed tokens: 18832424960 | elapsed time per iteration (s): 1.03 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.079096E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.832 | TFLOPs: 40.96 | 15: iteration 35930/ 125429 | consumed samples: 9198080 | consumed tokens: 18837667840 | elapsed time per iteration (s): 1.04 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.046641E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.951 | TFLOPs: 40.65 | 15: iteration 35940/ 125429 | consumed samples: 9200640 | consumed tokens: 18842910720 | elapsed time per iteration (s): 1.03 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.053309E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.616 | TFLOPs: 40.92 | 15: iteration 35950/ 125429 | consumed samples: 9203200 | consumed tokens: 18848153600 | elapsed time per iteration (s): 1.05 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.078564E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.933 | TFLOPs: 40.31 | 15: iteration 35960/ 125429 | consumed samples: 9205760 | consumed tokens: 18853396480 | elapsed time per iteration (s): 1.56 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.086045E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 164.084 | TFLOPs: 27.12 | 15: iteration 35970/ 125429 | consumed samples: 9208320 | consumed tokens: 18858639360 | elapsed time per iteration (s): 1.11 | learning rate: 1.675E-04 | global batch size: 256 | lm loss: 2.071039E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.157 | TFLOPs: 38.20 | 15: iteration 35980/ 125429 | consumed samples: 9210880 | consumed tokens: 18863882240 | elapsed time per iteration (s): 1.06 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.047477E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.193 | TFLOPs: 39.86 | 15: iteration 35990/ 125429 | consumed samples: 9213440 | consumed tokens: 18869125120 | elapsed time per iteration (s): 1.05 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.048755E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.550 | TFLOPs: 40.41 | 0: [2022-11-26 06:33:33,916] [INFO] [logging.py:68:log_dist] [Rank 0] step=36000, skipped=0, lr=[0.0001674088952755169, 0.0001674088952755169, 0.0001674088952755169], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 36000/ 125429 | consumed samples: 9216000 | consumed tokens: 18874368000 | elapsed time per iteration (s): 1.06 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.081419E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.423 | TFLOPs: 39.73 | 0: steps: 36000 loss: 2.1169 iter time (s): 1.070 samples/sec: 239.349 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 36000 | lm loss value: 2.031078E+00 | lm loss PPL: 7.622300E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 36000 to checkpoints_1b5 0: [2022-11-26 06:33:34,382] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step36000 is begin to save! 0: [2022-11-26 06:33:34,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:33:34,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:33:34,627] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:33:34,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:33:34,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:33:34,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:33:34,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:33:34,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:33:34,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:33:35,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:33:35,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:33:35,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:33:35,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:33:35,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:33:35,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:33:35,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:33:35,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:33:35,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:33:35,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:33:35,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:33:35,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:33:35,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:33:35,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:33:35,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:33:35,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:33:35,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:33:35,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:33:36,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:33:36,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:33:36,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:33:36,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:33:36,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:33:36,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:33:36,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:33:36,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:33:36,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:33:36,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:33:36,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:33:36,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:33:36,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:33:36,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:33:36,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:33:36,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:33:36,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:33:36,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:33:36,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:33:36,975] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:33:37,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:33:37,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:33:37,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:33:37,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:33:37,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:33:37,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:33:37,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:33:37,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_29-model_00-model_states.pt... 0: [2022-11-26 06:33:37,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_29-model_00-model_states.pt. 0: [2022-11-26 06:33:37,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:33:37,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:33:37,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/layer_32-model_00-model_states.pt... 0: [2022-11-26 06:33:37,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/layer_32-model_00-model_states.pt. 0: [2022-11-26 06:33:37,638] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step36000/mp_rank_00_model_states.pt 0: [2022-11-26 06:33:37,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:33:37,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:33:37,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:33:37,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 06:33:37,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 06:33:37,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:33:37,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 06:33:37,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:33:37,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:33:37,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:33:37,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 06:33:37,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 06:33:37,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 06:33:37,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 06:33:37,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 06:33:37,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:33:37,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:33:37,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 3: [2022-11-26 06:33:37,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 06:33:37,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 06:33:37,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 06:33:37,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:33:37,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:33:37,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 6: [2022-11-26 06:33:37,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:33:37,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:33:37,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 12: [2022-11-26 06:33:37,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:33:37,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 06:33:37,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 11: [2022-11-26 06:33:37,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:33:37,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 06:33:37,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:33:37,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 14: [2022-11-26 06:33:37,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:33:37,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:33:37,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 06:33:37,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:33:37,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:33:37,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 06:33:37,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 3: [2022-11-26 06:33:37,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 06:33:37,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 8: [2022-11-26 06:33:37,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:33:37,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:33:37,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:33:37,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 13: [2022-11-26 06:33:37,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:33:37,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 06:33:37,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 10: [2022-11-26 06:33:37,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:33:37,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 06:33:37,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:33:37,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:33:37,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:33:37,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:33:37,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:33:37,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:33:37,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 06:33:37,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 15: [2022-11-26 06:33:37,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 5: [2022-11-26 06:33:37,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:33:37,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:33:37,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:33:37,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:33:37,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 9: [2022-11-26 06:33:37,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 7: [2022-11-26 06:33:37,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:33:37,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:33:37,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 4: [2022-11-26 06:33:37,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:33:37,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 06:33:37,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:33:37,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:33:37,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: [2022-11-26 06:33:37,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:33:37,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 1: [2022-11-26 06:33:38,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step36000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 2: [2022-11-26 06:33:38,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step36000 is ready now! 0: successfully saved checkpoint at iteration 36000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3742.83 15: iteration 36010/ 125429 | consumed samples: 9218560 | consumed tokens: 18879610880 | elapsed time per iteration (s): 1.46 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.051425E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.786 | TFLOPs: 29.05 | 15: iteration 36020/ 125429 | consumed samples: 9221120 | consumed tokens: 18884853760 | elapsed time per iteration (s): 1.05 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.061073E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.711 | TFLOPs: 40.11 | 15: iteration 36030/ 125429 | consumed samples: 9223680 | consumed tokens: 18890096640 | elapsed time per iteration (s): 1.09 | learning rate: 1.674E-04 | global batch size: 256 | lm loss: 2.045381E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.940 | TFLOPs: 38.83 | 15: iteration 36040/ 125429 | consumed samples: 9226240 | consumed tokens: 18895339520 | elapsed time per iteration (s): 1.06 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.056553E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.109 | TFLOPs: 39.85 | 15: iteration 36050/ 125429 | consumed samples: 9228800 | consumed tokens: 18900582400 | elapsed time per iteration (s): 1.11 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.076552E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.960 | TFLOPs: 38.00 | 15: iteration 36060/ 125429 | consumed samples: 9231360 | consumed tokens: 18905825280 | elapsed time per iteration (s): 1.06 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.082853E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.540 | TFLOPs: 39.92 | 15: iteration 36070/ 125429 | consumed samples: 9233920 | consumed tokens: 18911068160 | elapsed time per iteration (s): 1.05 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.069613E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.297 | TFLOPs: 40.37 | 15: iteration 36080/ 125429 | consumed samples: 9236480 | consumed tokens: 18916311040 | elapsed time per iteration (s): 1.04 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.053477E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.761 | TFLOPs: 40.61 | 15: iteration 36090/ 125429 | consumed samples: 9239040 | consumed tokens: 18921553920 | elapsed time per iteration (s): 1.06 | learning rate: 1.673E-04 | global batch size: 256 | lm loss: 2.051638E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.334 | TFLOPs: 39.88 | 15: iteration 36100/ 125429 | consumed samples: 9241600 | consumed tokens: 18926796800 | elapsed time per iteration (s): 1.06 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.090715E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.459 | TFLOPs: 39.74 | 15: iteration 36110/ 125429 | consumed samples: 9244160 | consumed tokens: 18932039680 | elapsed time per iteration (s): 1.03 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.072790E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.239 | TFLOPs: 41.02 | 15: iteration 36120/ 125429 | consumed samples: 9246720 | consumed tokens: 18937282560 | elapsed time per iteration (s): 1.03 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.059304E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.628 | TFLOPs: 41.09 | 15: iteration 36130/ 125429 | consumed samples: 9249280 | consumed tokens: 18942525440 | elapsed time per iteration (s): 1.07 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.040271E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.977 | TFLOPs: 39.66 | 15: iteration 36140/ 125429 | consumed samples: 9251840 | consumed tokens: 18947768320 | elapsed time per iteration (s): 1.03 | learning rate: 1.672E-04 | global batch size: 256 | lm loss: 2.048054E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.408 | TFLOPs: 40.89 | 15: iteration 36150/ 125429 | consumed samples: 9254400 | consumed tokens: 18953011200 | elapsed time per iteration (s): 1.03 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.099191E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.172 | TFLOPs: 41.01 | 15: iteration 36160/ 125429 | consumed samples: 9256960 | consumed tokens: 18958254080 | elapsed time per iteration (s): 1.04 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.086303E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.180 | TFLOPs: 40.85 | 15: iteration 36170/ 125429 | consumed samples: 9259520 | consumed tokens: 18963496960 | elapsed time per iteration (s): 1.10 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.078825E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.374 | TFLOPs: 38.57 | 15: iteration 36180/ 125429 | consumed samples: 9262080 | consumed tokens: 18968739840 | elapsed time per iteration (s): 1.03 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.065459E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.346 | TFLOPs: 41.04 | 15: iteration 36190/ 125429 | consumed samples: 9264640 | consumed tokens: 18973982720 | elapsed time per iteration (s): 1.06 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.071988E+00 | grad norm: 0.673 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.779 | TFLOPs: 39.96 | 15: iteration 36200/ 125429 | consumed samples: 9267200 | consumed tokens: 18979225600 | elapsed time per iteration (s): 1.05 | learning rate: 1.671E-04 | global batch size: 256 | lm loss: 2.088008E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.205 | TFLOPs: 40.19 | 15: iteration 36210/ 125429 | consumed samples: 9269760 | consumed tokens: 18984468480 | elapsed time per iteration (s): 1.05 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.059151E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.349 | TFLOPs: 40.22 | 15: iteration 36220/ 125429 | consumed samples: 9272320 | consumed tokens: 18989711360 | elapsed time per iteration (s): 1.07 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.093586E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.969 | TFLOPs: 39.49 | 15: iteration 36230/ 125429 | consumed samples: 9274880 | consumed tokens: 18994954240 | elapsed time per iteration (s): 1.04 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.085375E+00 | grad norm: 0.488 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.386 | TFLOPs: 40.72 | 15: iteration 36240/ 125429 | consumed samples: 9277440 | consumed tokens: 19000197120 | elapsed time per iteration (s): 1.07 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.094510E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.679 | TFLOPs: 39.61 | 15: iteration 36250/ 125429 | consumed samples: 9280000 | consumed tokens: 19005440000 | elapsed time per iteration (s): 1.07 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.063506E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.011 | TFLOPs: 39.66 | 15: iteration 36260/ 125429 | consumed samples: 9282560 | consumed tokens: 19010682880 | elapsed time per iteration (s): 1.04 | learning rate: 1.670E-04 | global batch size: 256 | lm loss: 2.051304E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.546 | TFLOPs: 40.58 | 15: iteration 36270/ 125429 | consumed samples: 9285120 | consumed tokens: 19015925760 | elapsed time per iteration (s): 1.04 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.084656E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.676 | TFLOPs: 40.77 | 15: iteration 36280/ 125429 | consumed samples: 9287680 | consumed tokens: 19021168640 | elapsed time per iteration (s): 1.04 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.060534E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.217 | TFLOPs: 40.85 | 15: iteration 36290/ 125429 | consumed samples: 9290240 | consumed tokens: 19026411520 | elapsed time per iteration (s): 1.05 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.086981E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.487 | TFLOPs: 40.40 | 15: iteration 36300/ 125429 | consumed samples: 9292800 | consumed tokens: 19031654400 | elapsed time per iteration (s): 1.09 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.080102E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.044 | TFLOPs: 38.84 | 15: iteration 36310/ 125429 | consumed samples: 9295360 | consumed tokens: 19036897280 | elapsed time per iteration (s): 1.05 | learning rate: 1.669E-04 | global batch size: 256 | lm loss: 2.074659E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.150 | TFLOPs: 40.35 | 15: iteration 36320/ 125429 | consumed samples: 9297920 | consumed tokens: 19042140160 | elapsed time per iteration (s): 1.08 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.055790E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.573 | TFLOPs: 39.10 | 15: iteration 36330/ 125429 | consumed samples: 9300480 | consumed tokens: 19047383040 | elapsed time per iteration (s): 1.07 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.041615E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.798 | TFLOPs: 39.46 | 15: iteration 36340/ 125429 | consumed samples: 9303040 | consumed tokens: 19052625920 | elapsed time per iteration (s): 1.04 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.070212E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.437 | TFLOPs: 40.73 | 15: iteration 36350/ 125429 | consumed samples: 9305600 | consumed tokens: 19057868800 | elapsed time per iteration (s): 1.02 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.040174E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.108 | TFLOPs: 41.50 | 15: iteration 36360/ 125429 | consumed samples: 9308160 | consumed tokens: 19063111680 | elapsed time per iteration (s): 1.06 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.076775E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.888 | TFLOPs: 39.81 | 15: iteration 36370/ 125429 | consumed samples: 9310720 | consumed tokens: 19068354560 | elapsed time per iteration (s): 1.05 | learning rate: 1.668E-04 | global batch size: 256 | lm loss: 2.068049E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.466 | TFLOPs: 40.40 | 15: iteration 36380/ 125429 | consumed samples: 9313280 | consumed tokens: 19073597440 | elapsed time per iteration (s): 1.04 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.076631E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.885 | TFLOPs: 40.80 | 15: iteration 36390/ 125429 | consumed samples: 9315840 | consumed tokens: 19078840320 | elapsed time per iteration (s): 1.04 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.059618E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.093 | TFLOPs: 40.67 | 15: iteration 36400/ 125429 | consumed samples: 9318400 | consumed tokens: 19084083200 | elapsed time per iteration (s): 1.03 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.079794E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.723 | TFLOPs: 41.27 | 15: iteration 36410/ 125429 | consumed samples: 9320960 | consumed tokens: 19089326080 | elapsed time per iteration (s): 1.03 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.049816E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.729 | TFLOPs: 41.10 | 15: iteration 36420/ 125429 | consumed samples: 9323520 | consumed tokens: 19094568960 | elapsed time per iteration (s): 1.07 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.053236E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.681 | TFLOPs: 39.44 | 15: iteration 36430/ 125429 | consumed samples: 9326080 | consumed tokens: 19099811840 | elapsed time per iteration (s): 1.04 | learning rate: 1.667E-04 | global batch size: 256 | lm loss: 2.064986E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.179 | TFLOPs: 40.68 | 15: iteration 36440/ 125429 | consumed samples: 9328640 | consumed tokens: 19105054720 | elapsed time per iteration (s): 1.03 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.049202E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.820 | TFLOPs: 40.95 | 15: iteration 36450/ 125429 | consumed samples: 9331200 | consumed tokens: 19110297600 | elapsed time per iteration (s): 1.05 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.080625E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.749 | TFLOPs: 40.28 | 15: iteration 36460/ 125429 | consumed samples: 9333760 | consumed tokens: 19115540480 | elapsed time per iteration (s): 1.06 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.062066E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.374 | TFLOPs: 40.05 | 15: iteration 36470/ 125429 | consumed samples: 9336320 | consumed tokens: 19120783360 | elapsed time per iteration (s): 1.03 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.051863E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.184 | TFLOPs: 41.18 | 15: iteration 36480/ 125429 | consumed samples: 9338880 | consumed tokens: 19126026240 | elapsed time per iteration (s): 1.04 | learning rate: 1.666E-04 | global batch size: 256 | lm loss: 2.069785E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.724 | TFLOPs: 40.61 | 15: iteration 36490/ 125429 | consumed samples: 9341440 | consumed tokens: 19131269120 | elapsed time per iteration (s): 1.04 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.082225E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.532 | TFLOPs: 40.74 | 15: iteration 36500/ 125429 | consumed samples: 9344000 | consumed tokens: 19136512000 | elapsed time per iteration (s): 1.05 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.040094E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.124 | TFLOPs: 40.18 | 15: iteration 36510/ 125429 | consumed samples: 9346560 | consumed tokens: 19141754880 | elapsed time per iteration (s): 1.04 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.072034E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.077 | TFLOPs: 40.67 | 15: iteration 36520/ 125429 | consumed samples: 9349120 | consumed tokens: 19146997760 | elapsed time per iteration (s): 1.07 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.035628E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.839 | TFLOPs: 39.47 | 15: iteration 36530/ 125429 | consumed samples: 9351680 | consumed tokens: 19152240640 | elapsed time per iteration (s): 1.04 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.067249E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.816 | TFLOPs: 40.62 | 15: iteration 36540/ 125429 | consumed samples: 9354240 | consumed tokens: 19157483520 | elapsed time per iteration (s): 1.04 | learning rate: 1.665E-04 | global batch size: 256 | lm loss: 2.064900E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.953 | TFLOPs: 40.81 | 15: iteration 36550/ 125429 | consumed samples: 9356800 | consumed tokens: 19162726400 | elapsed time per iteration (s): 1.08 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.059743E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.374 | TFLOPs: 39.23 | 15: iteration 36560/ 125429 | consumed samples: 9359360 | consumed tokens: 19167969280 | elapsed time per iteration (s): 1.05 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.078593E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.545 | TFLOPs: 40.41 | 15: iteration 36570/ 125429 | consumed samples: 9361920 | consumed tokens: 19173212160 | elapsed time per iteration (s): 1.05 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.052599E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.649 | TFLOPs: 40.43 | 15: iteration 36580/ 125429 | consumed samples: 9364480 | consumed tokens: 19178455040 | elapsed time per iteration (s): 1.03 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.046035E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.629 | TFLOPs: 41.25 | 15: iteration 36590/ 125429 | consumed samples: 9367040 | consumed tokens: 19183697920 | elapsed time per iteration (s): 1.02 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.053996E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.238 | TFLOPs: 41.35 | 15: iteration 36600/ 125429 | consumed samples: 9369600 | consumed tokens: 19188940800 | elapsed time per iteration (s): 1.03 | learning rate: 1.664E-04 | global batch size: 256 | lm loss: 2.026930E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.484 | TFLOPs: 40.90 | 15: iteration 36610/ 125429 | consumed samples: 9372160 | consumed tokens: 19194183680 | elapsed time per iteration (s): 1.02 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.094803E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.947 | TFLOPs: 41.47 | 15: iteration 36620/ 125429 | consumed samples: 9374720 | consumed tokens: 19199426560 | elapsed time per iteration (s): 1.02 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.067207E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.061 | TFLOPs: 41.49 | 15: iteration 36630/ 125429 | consumed samples: 9377280 | consumed tokens: 19204669440 | elapsed time per iteration (s): 1.20 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.020173E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.840 | TFLOPs: 35.17 | 15: iteration 36640/ 125429 | consumed samples: 9379840 | consumed tokens: 19209912320 | elapsed time per iteration (s): 1.02 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.068848E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.196 | TFLOPs: 41.35 | 15: iteration 36650/ 125429 | consumed samples: 9382400 | consumed tokens: 19215155200 | elapsed time per iteration (s): 1.09 | learning rate: 1.663E-04 | global batch size: 256 | lm loss: 2.044673E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.396 | TFLOPs: 38.90 | 15: iteration 36660/ 125429 | consumed samples: 9384960 | consumed tokens: 19220398080 | elapsed time per iteration (s): 1.07 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.062008E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.888 | TFLOPs: 39.64 | 15: iteration 36670/ 125429 | consumed samples: 9387520 | consumed tokens: 19225640960 | elapsed time per iteration (s): 1.08 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.078127E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.096 | TFLOPs: 39.18 | 15: iteration 36680/ 125429 | consumed samples: 9390080 | consumed tokens: 19230883840 | elapsed time per iteration (s): 1.09 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.065351E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.319 | TFLOPs: 38.89 | 15: iteration 36690/ 125429 | consumed samples: 9392640 | consumed tokens: 19236126720 | elapsed time per iteration (s): 1.04 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.060093E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.381 | TFLOPs: 40.72 | 15: iteration 36700/ 125429 | consumed samples: 9395200 | consumed tokens: 19241369600 | elapsed time per iteration (s): 1.06 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.059069E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.039 | TFLOPs: 39.83 | 15: iteration 36710/ 125429 | consumed samples: 9397760 | consumed tokens: 19246612480 | elapsed time per iteration (s): 1.04 | learning rate: 1.662E-04 | global batch size: 256 | lm loss: 2.050443E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.266 | TFLOPs: 40.70 | 15: iteration 36720/ 125429 | consumed samples: 9400320 | consumed tokens: 19251855360 | elapsed time per iteration (s): 1.05 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.049042E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.140 | TFLOPs: 40.35 | 15: iteration 36730/ 125429 | consumed samples: 9402880 | consumed tokens: 19257098240 | elapsed time per iteration (s): 1.02 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.068577E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.338 | TFLOPs: 41.37 | 15: iteration 36740/ 125429 | consumed samples: 9405440 | consumed tokens: 19262341120 | elapsed time per iteration (s): 1.02 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.083145E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.750 | TFLOPs: 41.44 | 15: iteration 36750/ 125429 | consumed samples: 9408000 | consumed tokens: 19267584000 | elapsed time per iteration (s): 1.04 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.054399E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.366 | TFLOPs: 40.55 | 15: iteration 36760/ 125429 | consumed samples: 9410560 | consumed tokens: 19272826880 | elapsed time per iteration (s): 1.05 | learning rate: 1.661E-04 | global batch size: 256 | lm loss: 2.063587E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.440 | TFLOPs: 40.40 | 15: iteration 36770/ 125429 | consumed samples: 9413120 | consumed tokens: 19278069760 | elapsed time per iteration (s): 1.03 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.054810E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.638 | TFLOPs: 40.92 | 15: iteration 36780/ 125429 | consumed samples: 9415680 | consumed tokens: 19283312640 | elapsed time per iteration (s): 1.03 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.090772E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.688 | TFLOPs: 41.26 | 15: iteration 36790/ 125429 | consumed samples: 9418240 | consumed tokens: 19288555520 | elapsed time per iteration (s): 1.03 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.073355E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.669 | TFLOPs: 41.09 | 15: iteration 36800/ 125429 | consumed samples: 9420800 | consumed tokens: 19293798400 | elapsed time per iteration (s): 1.03 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.048906E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.157 | TFLOPs: 41.01 | 15: iteration 36810/ 125429 | consumed samples: 9423360 | consumed tokens: 19299041280 | elapsed time per iteration (s): 1.05 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.065896E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.854 | TFLOPs: 40.30 | 15: iteration 36820/ 125429 | consumed samples: 9425920 | consumed tokens: 19304284160 | elapsed time per iteration (s): 1.05 | learning rate: 1.660E-04 | global batch size: 256 | lm loss: 2.071498E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.561 | TFLOPs: 40.42 | 15: iteration 36830/ 125429 | consumed samples: 9428480 | consumed tokens: 19309527040 | elapsed time per iteration (s): 1.05 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.073518E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.965 | TFLOPs: 40.15 | 15: iteration 36840/ 125429 | consumed samples: 9431040 | consumed tokens: 19314769920 | elapsed time per iteration (s): 1.05 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.048200E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.494 | TFLOPs: 40.40 | 15: iteration 36850/ 125429 | consumed samples: 9433600 | consumed tokens: 19320012800 | elapsed time per iteration (s): 1.05 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.082887E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.733 | TFLOPs: 40.44 | 15: iteration 36860/ 125429 | consumed samples: 9436160 | consumed tokens: 19325255680 | elapsed time per iteration (s): 1.03 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.071805E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.321 | TFLOPs: 41.04 | 15: iteration 36870/ 125429 | consumed samples: 9438720 | consumed tokens: 19330498560 | elapsed time per iteration (s): 1.04 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.039639E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.338 | TFLOPs: 40.71 | 15: iteration 36880/ 125429 | consumed samples: 9441280 | consumed tokens: 19335741440 | elapsed time per iteration (s): 1.03 | learning rate: 1.659E-04 | global batch size: 256 | lm loss: 2.032995E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.541 | TFLOPs: 41.24 | 15: iteration 36890/ 125429 | consumed samples: 9443840 | consumed tokens: 19340984320 | elapsed time per iteration (s): 1.03 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.052616E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.744 | TFLOPs: 41.11 | 15: iteration 36900/ 125429 | consumed samples: 9446400 | consumed tokens: 19346227200 | elapsed time per iteration (s): 1.05 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.088130E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.215 | TFLOPs: 40.36 | 15: iteration 36910/ 125429 | consumed samples: 9448960 | consumed tokens: 19351470080 | elapsed time per iteration (s): 1.06 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.090633E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.770 | TFLOPs: 39.79 | 15: iteration 36920/ 125429 | consumed samples: 9451520 | consumed tokens: 19356712960 | elapsed time per iteration (s): 1.04 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.060701E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.198 | TFLOPs: 40.85 | 15: iteration 36930/ 125429 | consumed samples: 9454080 | consumed tokens: 19361955840 | elapsed time per iteration (s): 1.03 | learning rate: 1.658E-04 | global batch size: 256 | lm loss: 2.075396E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.487 | TFLOPs: 40.90 | 15: iteration 36940/ 125429 | consumed samples: 9456640 | consumed tokens: 19367198720 | elapsed time per iteration (s): 1.04 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.098425E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.252 | TFLOPs: 40.69 | 15: iteration 36950/ 125429 | consumed samples: 9459200 | consumed tokens: 19372441600 | elapsed time per iteration (s): 1.05 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.066928E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.556 | TFLOPs: 40.25 | 15: iteration 36960/ 125429 | consumed samples: 9461760 | consumed tokens: 19377684480 | elapsed time per iteration (s): 1.03 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.048831E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.641 | TFLOPs: 40.92 | 15: iteration 36970/ 125429 | consumed samples: 9464320 | consumed tokens: 19382927360 | elapsed time per iteration (s): 1.05 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.093942E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.979 | TFLOPs: 40.32 | 15: iteration 36980/ 125429 | consumed samples: 9466880 | consumed tokens: 19388170240 | elapsed time per iteration (s): 1.06 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.081045E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.497 | TFLOPs: 39.74 | 15: iteration 36990/ 125429 | consumed samples: 9469440 | consumed tokens: 19393413120 | elapsed time per iteration (s): 1.07 | learning rate: 1.657E-04 | global batch size: 256 | lm loss: 2.054610E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.253 | TFLOPs: 39.70 | 15: iteration 37000/ 125429 | consumed samples: 9472000 | consumed tokens: 19398656000 | elapsed time per iteration (s): 1.07 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.063689E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.115 | TFLOPs: 39.68 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 37000 | lm loss value: 2.072623E+00 | lm loss PPL: 7.945635E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 37000 to checkpoints_1b5 0: [2022-11-26 06:51:08,187] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step37000 is begin to save! 0: [2022-11-26 06:51:08,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_01-model_00-model_states.pt... 0: [2022-11-26 06:51:08,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_01-model_00-model_states.pt. 0: [2022-11-26 06:51:08,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_03-model_00-model_states.pt... 0: [2022-11-26 06:51:08,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_03-model_00-model_states.pt. 0: [2022-11-26 06:51:08,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_04-model_00-model_states.pt... 0: [2022-11-26 06:51:08,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_04-model_00-model_states.pt. 0: [2022-11-26 06:51:08,636] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_05-model_00-model_states.pt... 0: [2022-11-26 06:51:08,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_05-model_00-model_states.pt. 0: [2022-11-26 06:51:08,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_06-model_00-model_states.pt... 0: [2022-11-26 06:51:08,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_06-model_00-model_states.pt. 0: [2022-11-26 06:51:08,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_07-model_00-model_states.pt... 0: [2022-11-26 06:51:08,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_07-model_00-model_states.pt. 0: [2022-11-26 06:51:08,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_08-model_00-model_states.pt... 0: [2022-11-26 06:51:09,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_08-model_00-model_states.pt. 0: [2022-11-26 06:51:09,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_09-model_00-model_states.pt... 0: [2022-11-26 06:51:09,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_09-model_00-model_states.pt. 0: [2022-11-26 06:51:09,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_10-model_00-model_states.pt... 0: [2022-11-26 06:51:09,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_10-model_00-model_states.pt. 0: [2022-11-26 06:51:09,292] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_11-model_00-model_states.pt... 0: [2022-11-26 06:51:09,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_11-model_00-model_states.pt. 0: [2022-11-26 06:51:09,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_12-model_00-model_states.pt... 0: [2022-11-26 06:51:09,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_12-model_00-model_states.pt. 0: [2022-11-26 06:51:09,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_13-model_00-model_states.pt... 0: [2022-11-26 06:51:09,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_13-model_00-model_states.pt. 0: [2022-11-26 06:51:09,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_14-model_00-model_states.pt... 0: [2022-11-26 06:51:09,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_14-model_00-model_states.pt. 0: [2022-11-26 06:51:09,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_15-model_00-model_states.pt... 0: [2022-11-26 06:51:09,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_15-model_00-model_states.pt. 0: [2022-11-26 06:51:09,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_16-model_00-model_states.pt... 0: [2022-11-26 06:51:09,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_16-model_00-model_states.pt. 0: [2022-11-26 06:51:09,942] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_17-model_00-model_states.pt... 0: [2022-11-26 06:51:10,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_17-model_00-model_states.pt. 0: [2022-11-26 06:51:10,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_18-model_00-model_states.pt... 0: [2022-11-26 06:51:10,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_18-model_00-model_states.pt. 0: [2022-11-26 06:51:10,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_19-model_00-model_states.pt... 0: [2022-11-26 06:51:10,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_19-model_00-model_states.pt. 0: [2022-11-26 06:51:10,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_20-model_00-model_states.pt... 0: [2022-11-26 06:51:10,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_20-model_00-model_states.pt. 0: [2022-11-26 06:51:10,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_21-model_00-model_states.pt... 0: [2022-11-26 06:51:10,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_21-model_00-model_states.pt. 0: [2022-11-26 06:51:10,476] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_22-model_00-model_states.pt... 0: [2022-11-26 06:51:10,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_22-model_00-model_states.pt. 0: [2022-11-26 06:51:10,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_23-model_00-model_states.pt... 0: [2022-11-26 06:51:10,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_23-model_00-model_states.pt. 0: [2022-11-26 06:51:10,684] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_24-model_00-model_states.pt... 0: [2022-11-26 06:51:10,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_24-model_00-model_states.pt. 0: [2022-11-26 06:51:10,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_25-model_00-model_states.pt... 0: [2022-11-26 06:51:10,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_25-model_00-model_states.pt. 0: [2022-11-26 06:51:10,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_26-model_00-model_states.pt... 0: [2022-11-26 06:51:11,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_26-model_00-model_states.pt. 0: [2022-11-26 06:51:11,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_27-model_00-model_states.pt... 0: [2022-11-26 06:51:11,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_27-model_00-model_states.pt. 0: [2022-11-26 06:51:11,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_28-model_00-model_states.pt... 0: [2022-11-26 06:51:11,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_28-model_00-model_states.pt. 0: [2022-11-26 06:51:11,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_29-model_00-model_states.pt... 0: [2022-11-26 06:51:11,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_29-model_00-model_states.pt. 0: [2022-11-26 06:51:11,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_30-model_00-model_states.pt... 0: [2022-11-26 06:51:11,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_30-model_00-model_states.pt. 0: [2022-11-26 06:51:11,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/layer_32-model_00-model_states.pt... 0: [2022-11-26 06:51:11,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/layer_32-model_00-model_states.pt. 0: [2022-11-26 06:51:11,430] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step37000/mp_rank_00_model_states.pt 0: [2022-11-26 06:51:11,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/mp_rank_00_model_states.pt... 0: [2022-11-26 06:51:11,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/mp_rank_00_model_states.pt. 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 4: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 0: [2022-11-26 06:51:11,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 14: [2022-11-26 06:51:11,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 06:51:11,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 06:51:11,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:51:11,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 06:51:11,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 06:51:11,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 06:51:11,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 06:51:11,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,634] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,634] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,634] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 06:51:11,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 06:51:11,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 06:51:11,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 06:51:11,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 06:51:11,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 06:51:11,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 06:51:11,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:51:11,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:51:11,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 10: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 06:51:11,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 06:51:11,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:51:11,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:51:11,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:51:11,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:51:11,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 06:51:11,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 06:51:11,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 06:51:11,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:51:11,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 10: [2022-11-26 06:51:11,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 06:51:11,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 06:51:11,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:51:11,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 9: [2022-11-26 06:51:11,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 06:51:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 06:51:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:51:11,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 06:51:11,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 06:51:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 06:51:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 06:51:11,655] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 6: [2022-11-26 06:51:11,655] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:51:11,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:51:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:51:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 15: [2022-11-26 06:51:11,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 06:51:11,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 06:51:11,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:51:11,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:51:11,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 06:51:11,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 06:51:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 06:51:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 06:51:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 10: [2022-11-26 06:51:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 06:51:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 06:51:11,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 11: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 06:51:11,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 06:51:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 06:51:11,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 7: [2022-11-26 06:51:11,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 06:51:11,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 06:51:11,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 06:51:11,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 1: [2022-11-26 06:51:11,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 06:51:11,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 06:51:11,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 5: [2022-11-26 06:51:11,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 06:51:11,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 06:51:11,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 4: [2022-11-26 06:51:11,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 06:51:11,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 06:51:11,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 06:51:11,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 14: [2022-11-26 06:51:11,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 06:51:11,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 06:51:11,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 06:51:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 06:51:11,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 12: [2022-11-26 06:51:11,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 06:51:11,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 06:51:11,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 06:51:11,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 06:51:11,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 8: [2022-11-26 06:51:11,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 06:51:11,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 2: [2022-11-26 06:51:11,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 06:51:11,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 06:51:11,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 06:51:11,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 06:51:11,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 06:51:11,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 3: [2022-11-26 06:51:11,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 13: [2022-11-26 06:51:11,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: [2022-11-26 06:51:11,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step37000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 06:51:11,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step37000 is ready now! 0: successfully saved checkpoint at iteration 37000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3716.42 15: iteration 37010/ 125429 | consumed samples: 9474560 | consumed tokens: 19403898880 | elapsed time per iteration (s): 1.44 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.083788E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.863 | TFLOPs: 29.39 | 15: iteration 37020/ 125429 | consumed samples: 9477120 | consumed tokens: 19409141760 | elapsed time per iteration (s): 2.61 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.043877E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 98.145 | TFLOPs: 16.22 | 15: iteration 37030/ 125429 | consumed samples: 9479680 | consumed tokens: 19414384640 | elapsed time per iteration (s): 1.03 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.082506E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.901 | TFLOPs: 40.97 | 15: iteration 37040/ 125429 | consumed samples: 9482240 | consumed tokens: 19419627520 | elapsed time per iteration (s): 1.03 | learning rate: 1.656E-04 | global batch size: 256 | lm loss: 2.054390E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.880 | TFLOPs: 41.13 | 15: iteration 37050/ 125429 | consumed samples: 9484800 | consumed tokens: 19424870400 | elapsed time per iteration (s): 1.02 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.079205E+00 | grad norm: 0.284 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.157 | TFLOPs: 41.51 | 15: iteration 37060/ 125429 | consumed samples: 9487360 | consumed tokens: 19430113280 | elapsed time per iteration (s): 1.05 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.057975E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.874 | TFLOPs: 40.47 | 15: iteration 37070/ 125429 | consumed samples: 9489920 | consumed tokens: 19435356160 | elapsed time per iteration (s): 1.04 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.060263E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.238 | TFLOPs: 40.86 | 15: iteration 37080/ 125429 | consumed samples: 9492480 | consumed tokens: 19440599040 | elapsed time per iteration (s): 1.04 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.060580E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.090 | TFLOPs: 40.67 | 15: iteration 37090/ 125429 | consumed samples: 9495040 | consumed tokens: 19445841920 | elapsed time per iteration (s): 1.03 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.076611E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.157 | TFLOPs: 41.01 | 15: iteration 37100/ 125429 | consumed samples: 9497600 | consumed tokens: 19451084800 | elapsed time per iteration (s): 1.04 | learning rate: 1.655E-04 | global batch size: 256 | lm loss: 2.056396E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.992 | TFLOPs: 40.49 | 15: iteration 37110/ 125429 | consumed samples: 9500160 | consumed tokens: 19456327680 | elapsed time per iteration (s): 1.04 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.079477E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.069 | TFLOPs: 40.83 | 15: iteration 37120/ 125429 | consumed samples: 9502720 | consumed tokens: 19461570560 | elapsed time per iteration (s): 1.06 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.042146E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.576 | TFLOPs: 39.76 | 15: iteration 37130/ 125429 | consumed samples: 9505280 | consumed tokens: 19466813440 | elapsed time per iteration (s): 1.03 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.096904E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.825 | TFLOPs: 40.95 | 15: iteration 37140/ 125429 | consumed samples: 9507840 | consumed tokens: 19472056320 | elapsed time per iteration (s): 1.05 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.051268E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.799 | TFLOPs: 40.29 | 15: iteration 37150/ 125429 | consumed samples: 9510400 | consumed tokens: 19477299200 | elapsed time per iteration (s): 1.03 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.064815E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.051 | TFLOPs: 40.99 | 15: iteration 37160/ 125429 | consumed samples: 9512960 | consumed tokens: 19482542080 | elapsed time per iteration (s): 1.03 | learning rate: 1.654E-04 | global batch size: 256 | lm loss: 2.075959E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.436 | TFLOPs: 41.06 | 15: iteration 37170/ 125429 | consumed samples: 9515520 | consumed tokens: 19487784960 | elapsed time per iteration (s): 1.02 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.041421E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.390 | TFLOPs: 41.38 | 15: iteration 37180/ 125429 | consumed samples: 9518080 | consumed tokens: 19493027840 | elapsed time per iteration (s): 1.04 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.061933E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.007 | TFLOPs: 40.65 | 15: iteration 37190/ 125429 | consumed samples: 9520640 | consumed tokens: 19498270720 | elapsed time per iteration (s): 1.07 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.055948E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.083 | TFLOPs: 39.68 | 15: iteration 37200/ 125429 | consumed samples: 9523200 | consumed tokens: 19503513600 | elapsed time per iteration (s): 1.03 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.049921E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.347 | TFLOPs: 40.88 | 15: iteration 37210/ 125429 | consumed samples: 9525760 | consumed tokens: 19508756480 | elapsed time per iteration (s): 1.03 | learning rate: 1.653E-04 | global batch size: 256 | lm loss: 2.058377E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.910 | TFLOPs: 40.97 | 15: iteration 37220/ 125429 | consumed samples: 9528320 | consumed tokens: 19513999360 | elapsed time per iteration (s): 1.03 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.068988E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.302 | TFLOPs: 41.03 | 15: iteration 37230/ 125429 | consumed samples: 9530880 | consumed tokens: 19519242240 | elapsed time per iteration (s): 1.04 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.076274E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.007 | TFLOPs: 40.49 | 15: iteration 37240/ 125429 | consumed samples: 9533440 | consumed tokens: 19524485120 | elapsed time per iteration (s): 1.05 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.033357E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.379 | TFLOPs: 40.22 | 15: iteration 37250/ 125429 | consumed samples: 9536000 | consumed tokens: 19529728000 | elapsed time per iteration (s): 1.07 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.071203E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.350 | TFLOPs: 39.39 | 15: iteration 37260/ 125429 | consumed samples: 9538560 | consumed tokens: 19534970880 | elapsed time per iteration (s): 1.03 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.044723E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.493 | TFLOPs: 41.07 | 15: iteration 37270/ 125429 | consumed samples: 9541120 | consumed tokens: 19540213760 | elapsed time per iteration (s): 1.09 | learning rate: 1.652E-04 | global batch size: 256 | lm loss: 2.084440E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.253 | TFLOPs: 38.71 | 15: iteration 37280/ 125429 | consumed samples: 9543680 | consumed tokens: 19545456640 | elapsed time per iteration (s): 1.05 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.067601E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.671 | TFLOPs: 40.10 | 15: iteration 37290/ 125429 | consumed samples: 9546240 | consumed tokens: 19550699520 | elapsed time per iteration (s): 1.09 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.062983E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.561 | TFLOPs: 38.76 | 15: iteration 37300/ 125429 | consumed samples: 9548800 | consumed tokens: 19555942400 | elapsed time per iteration (s): 1.04 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.055378E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.095 | TFLOPs: 40.83 | 15: iteration 37310/ 125429 | consumed samples: 9551360 | consumed tokens: 19561185280 | elapsed time per iteration (s): 1.06 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.061280E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.975 | TFLOPs: 39.82 | 15: iteration 37320/ 125429 | consumed samples: 9553920 | consumed tokens: 19566428160 | elapsed time per iteration (s): 1.07 | learning rate: 1.651E-04 | global batch size: 256 | lm loss: 2.058724E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.953 | TFLOPs: 39.65 | 15: iteration 37330/ 125429 | consumed samples: 9556480 | consumed tokens: 19571671040 | elapsed time per iteration (s): 1.04 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.060916E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.030 | TFLOPs: 40.82 | 15: iteration 37340/ 125429 | consumed samples: 9559040 | consumed tokens: 19576913920 | elapsed time per iteration (s): 1.04 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.046479E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.615 | TFLOPs: 40.59 | 15: iteration 37350/ 125429 | consumed samples: 9561600 | consumed tokens: 19582156800 | elapsed time per iteration (s): 1.03 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.061049E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.647 | TFLOPs: 41.26 | 15: iteration 37360/ 125429 | consumed samples: 9564160 | consumed tokens: 19587399680 | elapsed time per iteration (s): 1.05 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.073575E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.192 | TFLOPs: 40.19 | 15: iteration 37370/ 125429 | consumed samples: 9566720 | consumed tokens: 19592642560 | elapsed time per iteration (s): 1.04 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.049363E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.287 | TFLOPs: 40.70 | 15: iteration 37380/ 125429 | consumed samples: 9569280 | consumed tokens: 19597885440 | elapsed time per iteration (s): 1.09 | learning rate: 1.650E-04 | global batch size: 256 | lm loss: 2.095549E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.980 | TFLOPs: 38.67 | 15: iteration 37390/ 125429 | consumed samples: 9571840 | consumed tokens: 19603128320 | elapsed time per iteration (s): 1.04 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.048480E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.280 | TFLOPs: 40.70 | 15: iteration 37400/ 125429 | consumed samples: 9574400 | consumed tokens: 19608371200 | elapsed time per iteration (s): 1.06 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.048965E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.687 | TFLOPs: 39.78 | 15: iteration 37410/ 125429 | consumed samples: 9576960 | consumed tokens: 19613614080 | elapsed time per iteration (s): 1.06 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.026882E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.456 | TFLOPs: 40.07 | 15: iteration 37420/ 125429 | consumed samples: 9579520 | consumed tokens: 19618856960 | elapsed time per iteration (s): 1.03 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.041397E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.870 | TFLOPs: 40.96 | 15: iteration 37430/ 125429 | consumed samples: 9582080 | consumed tokens: 19624099840 | elapsed time per iteration (s): 1.07 | learning rate: 1.649E-04 | global batch size: 256 | lm loss: 2.052831E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.963 | TFLOPs: 39.66 | 15: iteration 37440/ 125429 | consumed samples: 9584640 | consumed tokens: 19629342720 | elapsed time per iteration (s): 1.02 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.042042E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.602 | TFLOPs: 41.58 | 15: iteration 37450/ 125429 | consumed samples: 9587200 | consumed tokens: 19634585600 | elapsed time per iteration (s): 1.04 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.068900E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.013 | TFLOPs: 40.49 | 15: iteration 37460/ 125429 | consumed samples: 9589760 | consumed tokens: 19639828480 | elapsed time per iteration (s): 1.03 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.064725E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.455 | TFLOPs: 40.89 | 15: iteration 37470/ 125429 | consumed samples: 9592320 | consumed tokens: 19645071360 | elapsed time per iteration (s): 1.06 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.080974E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.394 | TFLOPs: 40.06 | 15: iteration 37480/ 125429 | consumed samples: 9594880 | consumed tokens: 19650314240 | elapsed time per iteration (s): 1.04 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.057222E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.187 | TFLOPs: 40.85 | 15: iteration 37490/ 125429 | consumed samples: 9597440 | consumed tokens: 19655557120 | elapsed time per iteration (s): 1.08 | learning rate: 1.648E-04 | global batch size: 256 | lm loss: 2.063951E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.366 | TFLOPs: 39.23 | 15: iteration 37500/ 125429 | consumed samples: 9600000 | consumed tokens: 19660800000 | elapsed time per iteration (s): 1.08 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.076431E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.694 | TFLOPs: 39.28 | 15: iteration 37510/ 125429 | consumed samples: 9602560 | consumed tokens: 19666042880 | elapsed time per iteration (s): 1.09 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.072345E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.524 | TFLOPs: 38.76 | 15: iteration 37520/ 125429 | consumed samples: 9605120 | consumed tokens: 19671285760 | elapsed time per iteration (s): 1.07 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.038909E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.261 | TFLOPs: 39.70 | 15: iteration 37530/ 125429 | consumed samples: 9607680 | consumed tokens: 19676528640 | elapsed time per iteration (s): 1.05 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.061480E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.728 | TFLOPs: 40.11 | 15: iteration 37540/ 125429 | consumed samples: 9610240 | consumed tokens: 19681771520 | elapsed time per iteration (s): 1.05 | learning rate: 1.647E-04 | global batch size: 256 | lm loss: 2.007618E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.139 | TFLOPs: 40.18 | 15: iteration 37550/ 125429 | consumed samples: 9612800 | consumed tokens: 19687014400 | elapsed time per iteration (s): 1.02 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.053540E+00 | grad norm: 0.715 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.898 | TFLOPs: 41.30 | 15: iteration 37560/ 125429 | consumed samples: 9615360 | consumed tokens: 19692257280 | elapsed time per iteration (s): 1.04 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.443045E+00 | grad norm: 5.015 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.765 | TFLOPs: 40.61 | 15: iteration 37570/ 125429 | consumed samples: 9617920 | consumed tokens: 19697500160 | elapsed time per iteration (s): 1.04 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.218541E+00 | grad norm: 0.314 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.736 | TFLOPs: 40.61 | 15: iteration 37580/ 125429 | consumed samples: 9620480 | consumed tokens: 19702743040 | elapsed time per iteration (s): 1.12 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.204940E+00 | grad norm: 0.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.792 | TFLOPs: 37.81 | 15: iteration 37590/ 125429 | consumed samples: 9623040 | consumed tokens: 19707985920 | elapsed time per iteration (s): 1.03 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.112421E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.240 | TFLOPs: 41.19 | 15: iteration 37600/ 125429 | consumed samples: 9625600 | consumed tokens: 19713228800 | elapsed time per iteration (s): 1.03 | learning rate: 1.646E-04 | global batch size: 256 | lm loss: 2.120068E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.243 | TFLOPs: 41.19 | 15: iteration 37610/ 125429 | consumed samples: 9628160 | consumed tokens: 19718471680 | elapsed time per iteration (s): 1.04 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.084492E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.081 | TFLOPs: 40.83 | 15: iteration 37620/ 125429 | consumed samples: 9630720 | consumed tokens: 19723714560 | elapsed time per iteration (s): 1.02 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.101250E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.575 | TFLOPs: 41.57 | 15: iteration 37630/ 125429 | consumed samples: 9633280 | consumed tokens: 19728957440 | elapsed time per iteration (s): 1.05 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.057262E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.043 | TFLOPs: 40.16 | 15: iteration 37640/ 125429 | consumed samples: 9635840 | consumed tokens: 19734200320 | elapsed time per iteration (s): 1.25 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.067796E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 204.540 | TFLOPs: 33.80 | 15: iteration 37650/ 125429 | consumed samples: 9638400 | consumed tokens: 19739443200 | elapsed time per iteration (s): 1.02 | learning rate: 1.645E-04 | global batch size: 256 | lm loss: 2.069146E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.319 | TFLOPs: 41.37 | 15: iteration 37660/ 125429 | consumed samples: 9640960 | consumed tokens: 19744686080 | elapsed time per iteration (s): 1.05 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.051443E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.972 | TFLOPs: 40.48 | 15: iteration 37670/ 125429 | consumed samples: 9643520 | consumed tokens: 19749928960 | elapsed time per iteration (s): 1.06 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.046765E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.375 | TFLOPs: 39.89 | 15: iteration 37680/ 125429 | consumed samples: 9646080 | consumed tokens: 19755171840 | elapsed time per iteration (s): 1.07 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.065336E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.006 | TFLOPs: 39.50 | 15: iteration 37690/ 125429 | consumed samples: 9648640 | consumed tokens: 19760414720 | elapsed time per iteration (s): 1.04 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.037558E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.306 | TFLOPs: 40.70 | 15: iteration 37700/ 125429 | consumed samples: 9651200 | consumed tokens: 19765657600 | elapsed time per iteration (s): 1.03 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.101571E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.512 | TFLOPs: 41.07 | 15: iteration 37710/ 125429 | consumed samples: 9653760 | consumed tokens: 19770900480 | elapsed time per iteration (s): 1.13 | learning rate: 1.644E-04 | global batch size: 256 | lm loss: 2.098985E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.159 | TFLOPs: 37.54 | 15: iteration 37720/ 125429 | consumed samples: 9656320 | consumed tokens: 19776143360 | elapsed time per iteration (s): 1.03 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.092274E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.056 | TFLOPs: 40.99 | 15: iteration 37730/ 125429 | consumed samples: 9658880 | consumed tokens: 19781386240 | elapsed time per iteration (s): 1.04 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.067460E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.068 | TFLOPs: 40.66 | 15: iteration 37740/ 125429 | consumed samples: 9661440 | consumed tokens: 19786629120 | elapsed time per iteration (s): 1.04 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.071833E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.026 | TFLOPs: 40.66 | 15: iteration 37750/ 125429 | consumed samples: 9664000 | consumed tokens: 19791872000 | elapsed time per iteration (s): 1.05 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.063668E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.171 | TFLOPs: 40.35 | 15: iteration 37760/ 125429 | consumed samples: 9666560 | consumed tokens: 19797114880 | elapsed time per iteration (s): 1.12 | learning rate: 1.643E-04 | global batch size: 256 | lm loss: 2.056365E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.836 | TFLOPs: 37.82 | 15: iteration 37770/ 125429 | consumed samples: 9669120 | consumed tokens: 19802357760 | elapsed time per iteration (s): 1.04 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.054230E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.304 | TFLOPs: 40.54 | 15: iteration 37780/ 125429 | consumed samples: 9671680 | consumed tokens: 19807600640 | elapsed time per iteration (s): 1.05 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.065473E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.115 | TFLOPs: 40.34 | 15: iteration 37790/ 125429 | consumed samples: 9674240 | consumed tokens: 19812843520 | elapsed time per iteration (s): 1.03 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.053092E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.817 | TFLOPs: 41.12 | 15: iteration 37800/ 125429 | consumed samples: 9676800 | consumed tokens: 19818086400 | elapsed time per iteration (s): 1.03 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.046585E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.993 | TFLOPs: 41.15 | 15: iteration 37810/ 125429 | consumed samples: 9679360 | consumed tokens: 19823329280 | elapsed time per iteration (s): 1.09 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.024304E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.827 | TFLOPs: 38.81 | 15: iteration 37820/ 125429 | consumed samples: 9681920 | consumed tokens: 19828572160 | elapsed time per iteration (s): 1.16 | learning rate: 1.642E-04 | global batch size: 256 | lm loss: 2.050291E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.760 | TFLOPs: 36.48 | 15: iteration 37830/ 125429 | consumed samples: 9684480 | consumed tokens: 19833815040 | elapsed time per iteration (s): 1.12 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.085936E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.242 | TFLOPs: 37.88 | 15: iteration 37840/ 125429 | consumed samples: 9687040 | consumed tokens: 19839057920 | elapsed time per iteration (s): 1.06 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.037255E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.766 | TFLOPs: 39.95 | 15: iteration 37850/ 125429 | consumed samples: 9689600 | consumed tokens: 19844300800 | elapsed time per iteration (s): 1.03 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.053545E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.123 | TFLOPs: 41.00 | 15: iteration 37860/ 125429 | consumed samples: 9692160 | consumed tokens: 19849543680 | elapsed time per iteration (s): 1.18 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.076763E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.252 | TFLOPs: 35.74 | 15: iteration 37870/ 125429 | consumed samples: 9694720 | consumed tokens: 19854786560 | elapsed time per iteration (s): 1.24 | learning rate: 1.641E-04 | global batch size: 256 | lm loss: 2.071414E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 207.158 | TFLOPs: 34.23 | 15: iteration 37880/ 125429 | consumed samples: 9697280 | consumed tokens: 19860029440 | elapsed time per iteration (s): 1.23 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.070264E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 207.439 | TFLOPs: 34.28 | 15: iteration 37890/ 125429 | consumed samples: 9699840 | consumed tokens: 19865272320 | elapsed time per iteration (s): 1.06 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.076538E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.159 | TFLOPs: 40.02 | 15: iteration 37900/ 125429 | consumed samples: 9702400 | consumed tokens: 19870515200 | elapsed time per iteration (s): 1.14 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.043539E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.501 | TFLOPs: 37.27 | 15: iteration 37910/ 125429 | consumed samples: 9704960 | consumed tokens: 19875758080 | elapsed time per iteration (s): 1.12 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.049174E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.266 | TFLOPs: 37.72 | 15: iteration 37920/ 125429 | consumed samples: 9707520 | consumed tokens: 19881000960 | elapsed time per iteration (s): 1.10 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.090087E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.698 | TFLOPs: 38.29 | 15: iteration 37930/ 125429 | consumed samples: 9710080 | consumed tokens: 19886243840 | elapsed time per iteration (s): 1.07 | learning rate: 1.640E-04 | global batch size: 256 | lm loss: 2.049147E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.776 | TFLOPs: 39.62 | 15: iteration 37940/ 125429 | consumed samples: 9712640 | consumed tokens: 19891486720 | elapsed time per iteration (s): 1.06 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.062784E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.306 | TFLOPs: 40.04 | 15: iteration 37950/ 125429 | consumed samples: 9715200 | consumed tokens: 19896729600 | elapsed time per iteration (s): 1.05 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.074243E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.859 | TFLOPs: 40.46 | 15: iteration 37960/ 125429 | consumed samples: 9717760 | consumed tokens: 19901972480 | elapsed time per iteration (s): 1.04 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.074081E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.441 | TFLOPs: 40.56 | 15: iteration 37970/ 125429 | consumed samples: 9720320 | consumed tokens: 19907215360 | elapsed time per iteration (s): 1.04 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.061039E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.072 | TFLOPs: 40.67 | 15: iteration 37980/ 125429 | consumed samples: 9722880 | consumed tokens: 19912458240 | elapsed time per iteration (s): 1.12 | learning rate: 1.639E-04 | global batch size: 256 | lm loss: 2.048725E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.836 | TFLOPs: 37.65 | 15: iteration 37990/ 125429 | consumed samples: 9725440 | consumed tokens: 19917701120 | elapsed time per iteration (s): 1.05 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.065233E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.199 | TFLOPs: 40.36 | 0: [2022-11-26 07:09:08,168] [INFO] [logging.py:68:log_dist] [Rank 0] step=38000, skipped=0, lr=[0.00016382973630028766, 0.00016382973630028766, 0.00016382973630028766], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 38000/ 125429 | consumed samples: 9728000 | consumed tokens: 19922944000 | elapsed time per iteration (s): 1.05 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.061610E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.350 | TFLOPs: 40.22 | 0: steps: 38000 loss: 2.1388 iter time (s): 1.060 samples/sec: 241.519 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 38000 | lm loss value: 2.072891E+00 | lm loss PPL: 7.947763E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 38000 to checkpoints_1b5 0: [2022-11-26 07:09:08,549] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step38000 is begin to save! 0: [2022-11-26 07:09:08,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:09:08,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:09:08,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:09:08,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:09:08,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:09:09,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:09:09,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:09:09,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:09:09,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:09:09,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:09:09,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:09:09,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:09:09,429] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:09:09,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:09:09,545] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:09:09,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:09:09,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:09:09,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:09:09,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:09:09,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:09:09,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:09:09,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:09:09,998] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:09:10,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:09:10,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:09:10,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:09:10,227] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:09:10,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:09:10,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:09:10,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:09:10,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:09:10,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:09:10,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:09:10,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:09:10,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:09:10,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:09:10,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:09:10,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:09:10,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:09:11,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:09:11,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:09:11,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:09:11,145] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:09:11,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:09:11,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:09:11,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:09:11,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:09:11,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:09:11,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:09:11,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:09:11,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:09:11,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:09:11,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:09:11,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:09:11,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_29-model_00-model_states.pt... 0: [2022-11-26 07:09:11,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_29-model_00-model_states.pt. 0: [2022-11-26 07:09:11,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:09:12,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:09:12,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/layer_32-model_00-model_states.pt... 0: [2022-11-26 07:09:12,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/layer_32-model_00-model_states.pt. 0: [2022-11-26 07:09:12,067] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step38000/mp_rank_00_model_states.pt 0: [2022-11-26 07:09:12,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:09:12,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:09:12,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step38000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:09:12,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 07:09:12,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:09:12,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 07:09:12,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 07:09:12,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 07:09:12,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:09:12,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:09:12,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 07:09:12,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 07:09:12,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:09:12,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 07:09:12,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 07:09:12,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 07:09:12,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 3: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:09:12,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:09:12,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 07:09:12,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 07:09:12,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 07:09:12,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:09:12,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:09:12,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 07:09:12,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 07:09:12,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 07:09:12,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:09:12,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 07:09:12,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 07:09:12,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:09:12,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:09:12,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 07:09:12,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 13: [2022-11-26 07:09:12,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:09:12,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:09:12,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:09:12,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 8: [2022-11-26 07:09:12,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 07:09:12,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 14: [2022-11-26 07:09:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:09:12,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 07:09:12,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 10: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 07:09:12,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 07:09:12,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:09:12,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 7: [2022-11-26 07:09:12,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 07:09:12,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 5: [2022-11-26 07:09:12,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:09:12,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 07:09:12,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 07:09:12,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 07:09:12,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 07:09:12,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 07:09:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 07:09:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 12: [2022-11-26 07:09:12,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:09:12,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 07:09:12,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:09:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 3: [2022-11-26 07:09:12,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:09:12,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:09:12,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 07:09:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 07:09:12,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:09:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 9: [2022-11-26 07:09:12,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 4: [2022-11-26 07:09:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:09:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:09:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:09:12,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:09:12,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 07:09:12,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 07:09:12,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 11: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:09:12,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:09:12,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 6: [2022-11-26 07:09:12,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:09:12,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:09:12,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 07:09:12,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:09:12,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 07:09:12,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 2: [2022-11-26 07:09:12,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 15: [2022-11-26 07:09:12,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: [2022-11-26 07:09:12,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 07:09:12,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:09:12,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:09:12,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:09:12,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:09:12,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 1: [2022-11-26 07:09:12,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:09:12,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step38000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 07:09:12,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step38000 is ready now! 0: successfully saved checkpoint at iteration 38000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3858.31 15: iteration 38010/ 125429 | consumed samples: 9730560 | consumed tokens: 19928186880 | elapsed time per iteration (s): 1.47 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.057635E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.814 | TFLOPs: 28.72 | 15: iteration 38020/ 125429 | consumed samples: 9733120 | consumed tokens: 19933429760 | elapsed time per iteration (s): 1.07 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.018600E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.796 | TFLOPs: 39.63 | 15: iteration 38030/ 125429 | consumed samples: 9735680 | consumed tokens: 19938672640 | elapsed time per iteration (s): 1.06 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.058092E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.312 | TFLOPs: 40.04 | 15: iteration 38040/ 125429 | consumed samples: 9738240 | consumed tokens: 19943915520 | elapsed time per iteration (s): 1.08 | learning rate: 1.638E-04 | global batch size: 256 | lm loss: 2.028992E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.344 | TFLOPs: 39.06 | 15: iteration 38050/ 125429 | consumed samples: 9740800 | consumed tokens: 19949158400 | elapsed time per iteration (s): 1.05 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.066323E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.741 | TFLOPs: 40.45 | 15: iteration 38060/ 125429 | consumed samples: 9743360 | consumed tokens: 19954401280 | elapsed time per iteration (s): 1.03 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.056785E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.483 | TFLOPs: 41.06 | 15: iteration 38070/ 125429 | consumed samples: 9745920 | consumed tokens: 19959644160 | elapsed time per iteration (s): 1.05 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.045534E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.887 | TFLOPs: 40.30 | 15: iteration 38080/ 125429 | consumed samples: 9748480 | consumed tokens: 19964887040 | elapsed time per iteration (s): 1.03 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.044464E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.091 | TFLOPs: 41.00 | 15: iteration 38090/ 125429 | consumed samples: 9751040 | consumed tokens: 19970129920 | elapsed time per iteration (s): 1.07 | learning rate: 1.637E-04 | global batch size: 256 | lm loss: 2.042553E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.565 | TFLOPs: 39.42 | 15: iteration 38100/ 125429 | consumed samples: 9753600 | consumed tokens: 19975372800 | elapsed time per iteration (s): 1.03 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.049639E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.110 | TFLOPs: 41.17 | 15: iteration 38110/ 125429 | consumed samples: 9756160 | consumed tokens: 19980615680 | elapsed time per iteration (s): 1.08 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.063753E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.806 | TFLOPs: 39.13 | 15: iteration 38120/ 125429 | consumed samples: 9758720 | consumed tokens: 19985858560 | elapsed time per iteration (s): 1.04 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.067680E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.514 | TFLOPs: 40.74 | 15: iteration 38130/ 125429 | consumed samples: 9761280 | consumed tokens: 19991101440 | elapsed time per iteration (s): 1.07 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.073576E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.600 | TFLOPs: 39.60 | 15: iteration 38140/ 125429 | consumed samples: 9763840 | consumed tokens: 19996344320 | elapsed time per iteration (s): 1.05 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.078229E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.521 | TFLOPs: 40.24 | 15: iteration 38150/ 125429 | consumed samples: 9766400 | consumed tokens: 20001587200 | elapsed time per iteration (s): 1.03 | learning rate: 1.636E-04 | global batch size: 256 | lm loss: 2.064470E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.636 | TFLOPs: 40.92 | 15: iteration 38160/ 125429 | consumed samples: 9768960 | consumed tokens: 20006830080 | elapsed time per iteration (s): 1.06 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.063142E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.429 | TFLOPs: 39.90 | 15: iteration 38170/ 125429 | consumed samples: 9771520 | consumed tokens: 20012072960 | elapsed time per iteration (s): 1.03 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.090118E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.540 | TFLOPs: 40.91 | 15: iteration 38180/ 125429 | consumed samples: 9774080 | consumed tokens: 20017315840 | elapsed time per iteration (s): 1.06 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.062989E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.296 | TFLOPs: 39.88 | 15: iteration 38190/ 125429 | consumed samples: 9776640 | consumed tokens: 20022558720 | elapsed time per iteration (s): 1.04 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.069397E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.047 | TFLOPs: 40.83 | 15: iteration 38200/ 125429 | consumed samples: 9779200 | consumed tokens: 20027801600 | elapsed time per iteration (s): 1.04 | learning rate: 1.635E-04 | global batch size: 256 | lm loss: 2.056775E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.387 | TFLOPs: 40.72 | 15: iteration 38210/ 125429 | consumed samples: 9781760 | consumed tokens: 20033044480 | elapsed time per iteration (s): 1.05 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.078986E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.509 | TFLOPs: 40.24 | 15: iteration 38220/ 125429 | consumed samples: 9784320 | consumed tokens: 20038287360 | elapsed time per iteration (s): 1.07 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.046606E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.302 | TFLOPs: 39.38 | 15: iteration 38230/ 125429 | consumed samples: 9786880 | consumed tokens: 20043530240 | elapsed time per iteration (s): 1.07 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.056127E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.322 | TFLOPs: 39.38 | 15: iteration 38240/ 125429 | consumed samples: 9789440 | consumed tokens: 20048773120 | elapsed time per iteration (s): 1.04 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.099028E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.479 | TFLOPs: 40.73 | 15: iteration 38250/ 125429 | consumed samples: 9792000 | consumed tokens: 20054016000 | elapsed time per iteration (s): 1.03 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.073171E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.662 | TFLOPs: 40.93 | 15: iteration 38260/ 125429 | consumed samples: 9794560 | consumed tokens: 20059258880 | elapsed time per iteration (s): 1.03 | learning rate: 1.634E-04 | global batch size: 256 | lm loss: 2.034992E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.669 | TFLOPs: 40.93 | 15: iteration 38270/ 125429 | consumed samples: 9797120 | consumed tokens: 20064501760 | elapsed time per iteration (s): 1.06 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.068318E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.317 | TFLOPs: 40.04 | 15: iteration 38280/ 125429 | consumed samples: 9799680 | consumed tokens: 20069744640 | elapsed time per iteration (s): 1.06 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.050742E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.796 | TFLOPs: 39.79 | 15: iteration 38290/ 125429 | consumed samples: 9802240 | consumed tokens: 20074987520 | elapsed time per iteration (s): 1.05 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.061956E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.556 | TFLOPs: 40.41 | 15: iteration 38300/ 125429 | consumed samples: 9804800 | consumed tokens: 20080230400 | elapsed time per iteration (s): 1.07 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.048326E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.274 | TFLOPs: 39.71 | 15: iteration 38310/ 125429 | consumed samples: 9807360 | consumed tokens: 20085473280 | elapsed time per iteration (s): 1.07 | learning rate: 1.633E-04 | global batch size: 256 | lm loss: 2.067852E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.737 | TFLOPs: 39.62 | 15: iteration 38320/ 125429 | consumed samples: 9809920 | consumed tokens: 20090716160 | elapsed time per iteration (s): 1.23 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.045933E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 208.969 | TFLOPs: 34.53 | 15: iteration 38330/ 125429 | consumed samples: 9812480 | consumed tokens: 20095959040 | elapsed time per iteration (s): 1.10 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.017873E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.608 | TFLOPs: 38.44 | 15: iteration 38340/ 125429 | consumed samples: 9815040 | consumed tokens: 20101201920 | elapsed time per iteration (s): 1.03 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.050850E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.498 | TFLOPs: 40.90 | 15: iteration 38350/ 125429 | consumed samples: 9817600 | consumed tokens: 20106444800 | elapsed time per iteration (s): 1.04 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.068593E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.956 | TFLOPs: 40.65 | 15: iteration 38360/ 125429 | consumed samples: 9820160 | consumed tokens: 20111687680 | elapsed time per iteration (s): 1.09 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.051072E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.890 | TFLOPs: 38.82 | 15: iteration 38370/ 125429 | consumed samples: 9822720 | consumed tokens: 20116930560 | elapsed time per iteration (s): 1.06 | learning rate: 1.632E-04 | global batch size: 256 | lm loss: 2.039767E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.588 | TFLOPs: 39.92 | 15: iteration 38380/ 125429 | consumed samples: 9825280 | consumed tokens: 20122173440 | elapsed time per iteration (s): 1.04 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.065394E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.383 | TFLOPs: 40.55 | 15: iteration 38390/ 125429 | consumed samples: 9827840 | consumed tokens: 20127416320 | elapsed time per iteration (s): 1.08 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.066066E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.891 | TFLOPs: 39.15 | 15: iteration 38400/ 125429 | consumed samples: 9830400 | consumed tokens: 20132659200 | elapsed time per iteration (s): 1.05 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.057891E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.757 | TFLOPs: 40.12 | 15: iteration 38410/ 125429 | consumed samples: 9832960 | consumed tokens: 20137902080 | elapsed time per iteration (s): 1.05 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.047810E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.747 | TFLOPs: 40.12 | 15: iteration 38420/ 125429 | consumed samples: 9835520 | consumed tokens: 20143144960 | elapsed time per iteration (s): 1.04 | learning rate: 1.631E-04 | global batch size: 256 | lm loss: 2.081187E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.492 | TFLOPs: 40.73 | 15: iteration 38430/ 125429 | consumed samples: 9838080 | consumed tokens: 20148387840 | elapsed time per iteration (s): 1.04 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.052252E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.771 | TFLOPs: 40.78 | 15: iteration 38440/ 125429 | consumed samples: 9840640 | consumed tokens: 20153630720 | elapsed time per iteration (s): 1.06 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.064284E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.790 | TFLOPs: 39.96 | 15: iteration 38450/ 125429 | consumed samples: 9843200 | consumed tokens: 20158873600 | elapsed time per iteration (s): 1.06 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.061936E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.580 | TFLOPs: 40.09 | 15: iteration 38460/ 125429 | consumed samples: 9845760 | consumed tokens: 20164116480 | elapsed time per iteration (s): 1.13 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.043287E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.340 | TFLOPs: 37.40 | 15: iteration 38470/ 125429 | consumed samples: 9848320 | consumed tokens: 20169359360 | elapsed time per iteration (s): 1.09 | learning rate: 1.630E-04 | global batch size: 256 | lm loss: 2.026277E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.984 | TFLOPs: 38.67 | 15: iteration 38480/ 125429 | consumed samples: 9850880 | consumed tokens: 20174602240 | elapsed time per iteration (s): 1.08 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.063957E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.705 | TFLOPs: 39.28 | 15: iteration 38490/ 125429 | consumed samples: 9853440 | consumed tokens: 20179845120 | elapsed time per iteration (s): 1.05 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.056797E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.738 | TFLOPs: 40.44 | 15: iteration 38500/ 125429 | consumed samples: 9856000 | consumed tokens: 20185088000 | elapsed time per iteration (s): 1.08 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.060903E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.145 | TFLOPs: 39.19 | 15: iteration 38510/ 125429 | consumed samples: 9858560 | consumed tokens: 20190330880 | elapsed time per iteration (s): 1.05 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.057240E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.346 | TFLOPs: 40.21 | 15: iteration 38520/ 125429 | consumed samples: 9861120 | consumed tokens: 20195573760 | elapsed time per iteration (s): 1.07 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.077451E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.725 | TFLOPs: 39.45 | 15: iteration 38530/ 125429 | consumed samples: 9863680 | consumed tokens: 20200816640 | elapsed time per iteration (s): 1.08 | learning rate: 1.629E-04 | global batch size: 256 | lm loss: 2.087913E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.806 | TFLOPs: 39.13 | 15: iteration 38540/ 125429 | consumed samples: 9866240 | consumed tokens: 20206059520 | elapsed time per iteration (s): 1.04 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.031850E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.530 | TFLOPs: 40.74 | 15: iteration 38550/ 125429 | consumed samples: 9868800 | consumed tokens: 20211302400 | elapsed time per iteration (s): 1.04 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.063736E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.034 | TFLOPs: 40.49 | 15: iteration 38560/ 125429 | consumed samples: 9871360 | consumed tokens: 20216545280 | elapsed time per iteration (s): 1.04 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.032256E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.229 | TFLOPs: 40.69 | 15: iteration 38570/ 125429 | consumed samples: 9873920 | consumed tokens: 20221788160 | elapsed time per iteration (s): 1.07 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.058752E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.599 | TFLOPs: 39.43 | 15: iteration 38580/ 125429 | consumed samples: 9876480 | consumed tokens: 20227031040 | elapsed time per iteration (s): 1.05 | learning rate: 1.628E-04 | global batch size: 256 | lm loss: 2.026123E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.242 | TFLOPs: 40.36 | 15: iteration 38590/ 125429 | consumed samples: 9879040 | consumed tokens: 20232273920 | elapsed time per iteration (s): 1.06 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.059016E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.449 | TFLOPs: 40.07 | 15: iteration 38600/ 125429 | consumed samples: 9881600 | consumed tokens: 20237516800 | elapsed time per iteration (s): 1.09 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.058761E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.048 | TFLOPs: 38.68 | 15: iteration 38610/ 125429 | consumed samples: 9884160 | consumed tokens: 20242759680 | elapsed time per iteration (s): 1.05 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.079954E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.662 | TFLOPs: 40.43 | 15: iteration 38620/ 125429 | consumed samples: 9886720 | consumed tokens: 20248002560 | elapsed time per iteration (s): 1.11 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.024921E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.500 | TFLOPs: 38.26 | 15: iteration 38630/ 125429 | consumed samples: 9889280 | consumed tokens: 20253245440 | elapsed time per iteration (s): 1.06 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.051817E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.249 | TFLOPs: 39.87 | 15: iteration 38640/ 125429 | consumed samples: 9891840 | consumed tokens: 20258488320 | elapsed time per iteration (s): 1.06 | learning rate: 1.627E-04 | global batch size: 256 | lm loss: 2.054176E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.364 | TFLOPs: 39.89 | 15: iteration 38650/ 125429 | consumed samples: 9894400 | consumed tokens: 20263731200 | elapsed time per iteration (s): 1.04 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.050309E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.393 | TFLOPs: 40.72 | 15: iteration 38660/ 125429 | consumed samples: 9896960 | consumed tokens: 20268974080 | elapsed time per iteration (s): 1.05 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.048638E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.571 | TFLOPs: 40.25 | 15: iteration 38670/ 125429 | consumed samples: 9899520 | consumed tokens: 20274216960 | elapsed time per iteration (s): 1.09 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.053617E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.337 | TFLOPs: 38.73 | 15: iteration 38680/ 125429 | consumed samples: 9902080 | consumed tokens: 20279459840 | elapsed time per iteration (s): 1.07 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.066962E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.991 | TFLOPs: 39.66 | 15: iteration 38690/ 125429 | consumed samples: 9904640 | consumed tokens: 20284702720 | elapsed time per iteration (s): 1.05 | learning rate: 1.626E-04 | global batch size: 256 | lm loss: 2.065428E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.471 | TFLOPs: 40.24 | 15: iteration 38700/ 125429 | consumed samples: 9907200 | consumed tokens: 20289945600 | elapsed time per iteration (s): 1.10 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.046424E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.704 | TFLOPs: 38.29 | 15: iteration 38710/ 125429 | consumed samples: 9909760 | consumed tokens: 20295188480 | elapsed time per iteration (s): 1.07 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.064424E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.191 | TFLOPs: 39.36 | 15: iteration 38720/ 125429 | consumed samples: 9912320 | consumed tokens: 20300431360 | elapsed time per iteration (s): 1.06 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.082589E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.575 | TFLOPs: 40.09 | 15: iteration 38730/ 125429 | consumed samples: 9914880 | consumed tokens: 20305674240 | elapsed time per iteration (s): 1.08 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.039249E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.963 | TFLOPs: 39.16 | 15: iteration 38740/ 125429 | consumed samples: 9917440 | consumed tokens: 20310917120 | elapsed time per iteration (s): 1.05 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.020069E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.474 | TFLOPs: 40.40 | 15: iteration 38750/ 125429 | consumed samples: 9920000 | consumed tokens: 20316160000 | elapsed time per iteration (s): 1.02 | learning rate: 1.625E-04 | global batch size: 256 | lm loss: 2.045035E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.834 | TFLOPs: 41.29 | 15: iteration 38760/ 125429 | consumed samples: 9922560 | consumed tokens: 20321402880 | elapsed time per iteration (s): 1.05 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.035607E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.739 | TFLOPs: 40.45 | 15: iteration 38770/ 125429 | consumed samples: 9925120 | consumed tokens: 20326645760 | elapsed time per iteration (s): 1.04 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.031354E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.531 | TFLOPs: 40.74 | 15: iteration 38780/ 125429 | consumed samples: 9927680 | consumed tokens: 20331888640 | elapsed time per iteration (s): 1.06 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.043920E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.067 | TFLOPs: 40.00 | 15: iteration 38790/ 125429 | consumed samples: 9930240 | consumed tokens: 20337131520 | elapsed time per iteration (s): 1.06 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.055248E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.690 | TFLOPs: 39.78 | 15: iteration 38800/ 125429 | consumed samples: 9932800 | consumed tokens: 20342374400 | elapsed time per iteration (s): 1.05 | learning rate: 1.624E-04 | global batch size: 256 | lm loss: 2.050336E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.276 | TFLOPs: 40.37 | 15: iteration 38810/ 125429 | consumed samples: 9935360 | consumed tokens: 20347617280 | elapsed time per iteration (s): 1.04 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.045077E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.778 | TFLOPs: 40.78 | 15: iteration 38820/ 125429 | consumed samples: 9937920 | consumed tokens: 20352860160 | elapsed time per iteration (s): 1.03 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.059965E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.357 | TFLOPs: 40.88 | 15: iteration 38830/ 125429 | consumed samples: 9940480 | consumed tokens: 20358103040 | elapsed time per iteration (s): 1.08 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.063763E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.434 | TFLOPs: 39.07 | 15: iteration 38840/ 125429 | consumed samples: 9943040 | consumed tokens: 20363345920 | elapsed time per iteration (s): 1.05 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.043389E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.892 | TFLOPs: 40.47 | 15: iteration 38850/ 125429 | consumed samples: 9945600 | consumed tokens: 20368588800 | elapsed time per iteration (s): 1.04 | learning rate: 1.623E-04 | global batch size: 256 | lm loss: 2.035111E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.980 | TFLOPs: 40.82 | 15: iteration 38860/ 125429 | consumed samples: 9948160 | consumed tokens: 20373831680 | elapsed time per iteration (s): 1.04 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.043361E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.082 | TFLOPs: 40.67 | 15: iteration 38870/ 125429 | consumed samples: 9950720 | consumed tokens: 20379074560 | elapsed time per iteration (s): 1.04 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.070115E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.749 | TFLOPs: 40.61 | 15: iteration 38880/ 125429 | consumed samples: 9953280 | consumed tokens: 20384317440 | elapsed time per iteration (s): 1.04 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.070684E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.698 | TFLOPs: 40.77 | 15: iteration 38890/ 125429 | consumed samples: 9955840 | consumed tokens: 20389560320 | elapsed time per iteration (s): 1.04 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.053555E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.663 | TFLOPs: 40.76 | 15: iteration 38900/ 125429 | consumed samples: 9958400 | consumed tokens: 20394803200 | elapsed time per iteration (s): 1.05 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.044870E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.269 | TFLOPs: 40.37 | 15: iteration 38910/ 125429 | consumed samples: 9960960 | consumed tokens: 20400046080 | elapsed time per iteration (s): 1.02 | learning rate: 1.622E-04 | global batch size: 256 | lm loss: 2.050960E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.581 | TFLOPs: 41.58 | 15: iteration 38920/ 125429 | consumed samples: 9963520 | consumed tokens: 20405288960 | elapsed time per iteration (s): 1.04 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.058266E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.515 | TFLOPs: 40.57 | 15: iteration 38930/ 125429 | consumed samples: 9966080 | consumed tokens: 20410531840 | elapsed time per iteration (s): 1.04 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.055392E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.841 | TFLOPs: 40.63 | 15: iteration 38940/ 125429 | consumed samples: 9968640 | consumed tokens: 20415774720 | elapsed time per iteration (s): 1.06 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.048142E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.375 | TFLOPs: 39.89 | 15: iteration 38950/ 125429 | consumed samples: 9971200 | consumed tokens: 20421017600 | elapsed time per iteration (s): 1.04 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.056520E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.138 | TFLOPs: 40.68 | 15: iteration 38960/ 125429 | consumed samples: 9973760 | consumed tokens: 20426260480 | elapsed time per iteration (s): 1.05 | learning rate: 1.621E-04 | global batch size: 256 | lm loss: 2.049096E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.415 | TFLOPs: 40.39 | 15: iteration 38970/ 125429 | consumed samples: 9976320 | consumed tokens: 20431503360 | elapsed time per iteration (s): 1.02 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.062078E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.827 | TFLOPs: 41.29 | 15: iteration 38980/ 125429 | consumed samples: 9978880 | consumed tokens: 20436746240 | elapsed time per iteration (s): 1.04 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.048419E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.179 | TFLOPs: 40.68 | 15: iteration 38990/ 125429 | consumed samples: 9981440 | consumed tokens: 20441989120 | elapsed time per iteration (s): 1.43 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.043497E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.729 | TFLOPs: 29.54 | 15: iteration 39000/ 125429 | consumed samples: 9984000 | consumed tokens: 20447232000 | elapsed time per iteration (s): 1.03 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.050971E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.547 | TFLOPs: 41.07 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 39000 | lm loss value: 1.991463E+00 | lm loss PPL: 7.326241E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 39000 to checkpoints_1b5 0: [2022-11-26 07:26:53,359] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step39000 is begin to save! 0: [2022-11-26 07:26:53,368] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:26:53,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:26:53,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:26:53,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:26:53,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:26:53,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:26:53,833] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:26:53,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:26:53,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:26:54,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:26:54,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:26:54,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:26:54,147] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:26:54,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:26:54,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:26:54,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:26:54,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:26:54,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:26:54,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:26:54,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:26:54,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:26:54,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:26:54,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:26:54,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:26:54,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:26:54,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:26:54,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:26:55,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:26:55,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:26:55,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:26:55,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:26:55,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:26:55,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:26:55,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:26:55,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:26:55,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:26:55,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:26:55,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:26:55,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:26:55,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:26:55,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:26:55,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:26:55,788] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:26:55,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:26:56,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:26:56,001] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:26:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:26:56,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:26:56,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:26:56,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:26:56,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:26:56,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:26:56,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:26:56,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_29-model_00-model_states.pt... 0: [2022-11-26 07:26:56,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_29-model_00-model_states.pt. 0: [2022-11-26 07:26:56,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:26:56,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:26:56,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/layer_32-model_00-model_states.pt... 0: [2022-11-26 07:26:56,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/layer_32-model_00-model_states.pt. 0: [2022-11-26 07:26:56,636] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step39000/mp_rank_00_model_states.pt 0: [2022-11-26 07:26:56,636] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:26:56,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:26:56,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step39000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:26:56,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:26:56,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 07:26:56,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 07:26:56,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 07:26:56,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:26:56,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 07:26:56,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:26:56,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 07:26:56,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 07:26:56,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:26:56,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:26:56,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 07:26:56,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 07:26:56,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:26:56,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 07:26:56,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:26:56,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 07:26:56,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:26:56,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 07:26:56,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 07:26:56,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:26:56,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:26:56,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:26:56,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:26:56,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 07:26:56,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 07:26:56,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 07:26:56,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:26:56,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:26:56,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 07:26:56,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 07:26:56,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 07:26:56,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 13: [2022-11-26 07:26:56,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 07:26:56,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 07:26:56,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 07:26:56,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 07:26:56,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 10: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:26:56,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:26:56,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 13: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 1: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 5: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:26:56,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 07:26:56,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 07:26:56,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 07:26:56,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 07:26:56,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 07:26:56,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 07:26:56,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:26:56,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:26:56,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 07:26:56,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 07:26:56,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 07:26:56,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 9: [2022-11-26 07:26:56,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:26:56,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 07:26:56,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 3: [2022-11-26 07:26:56,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:26:56,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:26:56,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 7: [2022-11-26 07:26:56,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:26:56,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 07:26:56,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 12: [2022-11-26 07:26:56,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:26:56,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 07:26:56,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:26:56,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:26:56,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:26:56,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:26:56,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:26:56,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:26:56,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 2: [2022-11-26 07:26:56,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:26:56,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:26:56,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:26:56,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:26:56,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:26:56,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:26:56,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 11: [2022-11-26 07:26:56,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:26:56,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 07:26:56,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 8: [2022-11-26 07:26:56,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:26:56,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 07:26:56,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: [2022-11-26 07:26:56,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:26:56,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 14: [2022-11-26 07:26:56,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 15: [2022-11-26 07:26:56,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:26:56,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 07:26:56,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 4: [2022-11-26 07:26:57,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:26:57,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:26:57,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step39000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 6: [2022-11-26 07:26:57,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step39000 is ready now! 0: successfully saved checkpoint at iteration 39000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3773.56 15: iteration 39010/ 125429 | consumed samples: 9986560 | consumed tokens: 20452474880 | elapsed time per iteration (s): 1.45 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.091282E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.938 | TFLOPs: 29.24 | 15: iteration 39020/ 125429 | consumed samples: 9989120 | consumed tokens: 20457717760 | elapsed time per iteration (s): 1.05 | learning rate: 1.620E-04 | global batch size: 256 | lm loss: 2.077710E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.541 | TFLOPs: 40.25 | 15: iteration 39030/ 125429 | consumed samples: 9991680 | consumed tokens: 20462960640 | elapsed time per iteration (s): 1.04 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.019716E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.527 | TFLOPs: 40.74 | 15: iteration 39040/ 125429 | consumed samples: 9994240 | consumed tokens: 20468203520 | elapsed time per iteration (s): 1.04 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.065713E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.236 | TFLOPs: 40.53 | 15: iteration 39050/ 125429 | consumed samples: 9996800 | consumed tokens: 20473446400 | elapsed time per iteration (s): 1.03 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.039900E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.657 | TFLOPs: 40.93 | 15: iteration 39060/ 125429 | consumed samples: 9999360 | consumed tokens: 20478689280 | elapsed time per iteration (s): 1.27 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.059540E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 201.599 | TFLOPs: 33.32 | 15: iteration 39070/ 125429 | consumed samples: 10001920 | consumed tokens: 20483932160 | elapsed time per iteration (s): 1.05 | learning rate: 1.619E-04 | global batch size: 256 | lm loss: 2.018553E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.100 | TFLOPs: 40.17 | 15: iteration 39080/ 125429 | consumed samples: 10004480 | consumed tokens: 20489175040 | elapsed time per iteration (s): 1.05 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.071301E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.836 | TFLOPs: 40.30 | 15: iteration 39090/ 125429 | consumed samples: 10007040 | consumed tokens: 20494417920 | elapsed time per iteration (s): 1.03 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.057330E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.631 | TFLOPs: 41.25 | 15: iteration 39100/ 125429 | consumed samples: 10009600 | consumed tokens: 20499660800 | elapsed time per iteration (s): 1.04 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.044612E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.092 | TFLOPs: 40.50 | 15: iteration 39110/ 125429 | consumed samples: 10012160 | consumed tokens: 20504903680 | elapsed time per iteration (s): 1.02 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.033060E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.526 | TFLOPs: 41.57 | 15: iteration 39120/ 125429 | consumed samples: 10014720 | consumed tokens: 20510146560 | elapsed time per iteration (s): 1.03 | learning rate: 1.618E-04 | global batch size: 256 | lm loss: 2.065134E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.141 | TFLOPs: 41.17 | 15: iteration 39130/ 125429 | consumed samples: 10017280 | consumed tokens: 20515389440 | elapsed time per iteration (s): 1.05 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.058233E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.885 | TFLOPs: 40.30 | 15: iteration 39140/ 125429 | consumed samples: 10019840 | consumed tokens: 20520632320 | elapsed time per iteration (s): 1.02 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.040880E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.380 | TFLOPs: 41.38 | 15: iteration 39150/ 125429 | consumed samples: 10022400 | consumed tokens: 20525875200 | elapsed time per iteration (s): 1.06 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.049458E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.646 | TFLOPs: 40.10 | 15: iteration 39160/ 125429 | consumed samples: 10024960 | consumed tokens: 20531118080 | elapsed time per iteration (s): 1.03 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.042544E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.721 | TFLOPs: 40.94 | 15: iteration 39170/ 125429 | consumed samples: 10027520 | consumed tokens: 20536360960 | elapsed time per iteration (s): 1.04 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.056343E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.598 | TFLOPs: 40.75 | 15: iteration 39180/ 125429 | consumed samples: 10030080 | consumed tokens: 20541603840 | elapsed time per iteration (s): 1.03 | learning rate: 1.617E-04 | global batch size: 256 | lm loss: 2.065672E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.616 | TFLOPs: 41.09 | 15: iteration 39190/ 125429 | consumed samples: 10032640 | consumed tokens: 20546846720 | elapsed time per iteration (s): 1.05 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.034580E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.729 | TFLOPs: 40.44 | 15: iteration 39200/ 125429 | consumed samples: 10035200 | consumed tokens: 20552089600 | elapsed time per iteration (s): 1.06 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.046479E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.382 | TFLOPs: 39.72 | 15: iteration 39210/ 125429 | consumed samples: 10037760 | consumed tokens: 20557332480 | elapsed time per iteration (s): 1.06 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.035375E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.109 | TFLOPs: 40.01 | 15: iteration 39220/ 125429 | consumed samples: 10040320 | consumed tokens: 20562575360 | elapsed time per iteration (s): 1.03 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.055157E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.952 | TFLOPs: 40.98 | 15: iteration 39230/ 125429 | consumed samples: 10042880 | consumed tokens: 20567818240 | elapsed time per iteration (s): 1.04 | learning rate: 1.616E-04 | global batch size: 256 | lm loss: 2.030409E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.153 | TFLOPs: 40.68 | 15: iteration 39240/ 125429 | consumed samples: 10045440 | consumed tokens: 20573061120 | elapsed time per iteration (s): 1.04 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.044112E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.244 | TFLOPs: 40.53 | 15: iteration 39250/ 125429 | consumed samples: 10048000 | consumed tokens: 20578304000 | elapsed time per iteration (s): 1.04 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.015471E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.727 | TFLOPs: 40.61 | 15: iteration 39260/ 125429 | consumed samples: 10050560 | consumed tokens: 20583546880 | elapsed time per iteration (s): 1.04 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.038959E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.150 | TFLOPs: 40.51 | 15: iteration 39270/ 125429 | consumed samples: 10053120 | consumed tokens: 20588789760 | elapsed time per iteration (s): 1.03 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.043478E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.408 | TFLOPs: 40.89 | 15: iteration 39280/ 125429 | consumed samples: 10055680 | consumed tokens: 20594032640 | elapsed time per iteration (s): 1.05 | learning rate: 1.615E-04 | global batch size: 256 | lm loss: 2.058136E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.984 | TFLOPs: 40.32 | 15: iteration 39290/ 125429 | consumed samples: 10058240 | consumed tokens: 20599275520 | elapsed time per iteration (s): 1.06 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.052985E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.449 | TFLOPs: 39.74 | 15: iteration 39300/ 125429 | consumed samples: 10060800 | consumed tokens: 20604518400 | elapsed time per iteration (s): 1.05 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.071805E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.140 | TFLOPs: 40.18 | 15: iteration 39310/ 125429 | consumed samples: 10063360 | consumed tokens: 20609761280 | elapsed time per iteration (s): 1.02 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.091546E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.979 | TFLOPs: 41.31 | 15: iteration 39320/ 125429 | consumed samples: 10065920 | consumed tokens: 20615004160 | elapsed time per iteration (s): 1.03 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.019644E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.141 | TFLOPs: 41.01 | 15: iteration 39330/ 125429 | consumed samples: 10068480 | consumed tokens: 20620247040 | elapsed time per iteration (s): 1.04 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.058808E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.244 | TFLOPs: 40.53 | 15: iteration 39340/ 125429 | consumed samples: 10071040 | consumed tokens: 20625489920 | elapsed time per iteration (s): 1.05 | learning rate: 1.614E-04 | global batch size: 256 | lm loss: 2.053482E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.320 | TFLOPs: 40.38 | 15: iteration 39350/ 125429 | consumed samples: 10073600 | consumed tokens: 20630732800 | elapsed time per iteration (s): 1.04 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.043475E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.300 | TFLOPs: 40.54 | 15: iteration 39360/ 125429 | consumed samples: 10076160 | consumed tokens: 20635975680 | elapsed time per iteration (s): 1.06 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.052881E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.535 | TFLOPs: 40.08 | 15: iteration 39370/ 125429 | consumed samples: 10078720 | consumed tokens: 20641218560 | elapsed time per iteration (s): 1.02 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.029462E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.008 | TFLOPs: 41.32 | 15: iteration 39380/ 125429 | consumed samples: 10081280 | consumed tokens: 20646461440 | elapsed time per iteration (s): 1.07 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.063064E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.359 | TFLOPs: 39.56 | 15: iteration 39390/ 125429 | consumed samples: 10083840 | consumed tokens: 20651704320 | elapsed time per iteration (s): 1.05 | learning rate: 1.613E-04 | global batch size: 256 | lm loss: 2.078586E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.779 | TFLOPs: 40.12 | 15: iteration 39400/ 125429 | consumed samples: 10086400 | consumed tokens: 20656947200 | elapsed time per iteration (s): 1.03 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.028333E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.732 | TFLOPs: 40.94 | 15: iteration 39410/ 125429 | consumed samples: 10088960 | consumed tokens: 20662190080 | elapsed time per iteration (s): 1.04 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.008693E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.842 | TFLOPs: 40.63 | 15: iteration 39420/ 125429 | consumed samples: 10091520 | consumed tokens: 20667432960 | elapsed time per iteration (s): 1.04 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.044714E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.171 | TFLOPs: 40.85 | 15: iteration 39430/ 125429 | consumed samples: 10094080 | consumed tokens: 20672675840 | elapsed time per iteration (s): 1.06 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.061185E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.015 | TFLOPs: 39.83 | 15: iteration 39440/ 125429 | consumed samples: 10096640 | consumed tokens: 20677918720 | elapsed time per iteration (s): 1.04 | learning rate: 1.612E-04 | global batch size: 256 | lm loss: 2.053258E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.076 | TFLOPs: 40.50 | 15: iteration 39450/ 125429 | consumed samples: 10099200 | consumed tokens: 20683161600 | elapsed time per iteration (s): 1.05 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.061539E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.201 | TFLOPs: 40.19 | 15: iteration 39460/ 125429 | consumed samples: 10101760 | consumed tokens: 20688404480 | elapsed time per iteration (s): 1.05 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.076178E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.722 | TFLOPs: 40.44 | 15: iteration 39470/ 125429 | consumed samples: 10104320 | consumed tokens: 20693647360 | elapsed time per iteration (s): 1.05 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.052841E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.191 | TFLOPs: 40.35 | 15: iteration 39480/ 125429 | consumed samples: 10106880 | consumed tokens: 20698890240 | elapsed time per iteration (s): 1.09 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.061648E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.564 | TFLOPs: 38.76 | 15: iteration 39490/ 125429 | consumed samples: 10109440 | consumed tokens: 20704133120 | elapsed time per iteration (s): 1.03 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.073604E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.501 | TFLOPs: 41.07 | 15: iteration 39500/ 125429 | consumed samples: 10112000 | consumed tokens: 20709376000 | elapsed time per iteration (s): 1.03 | learning rate: 1.611E-04 | global batch size: 256 | lm loss: 2.063255E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.616 | TFLOPs: 40.92 | 15: iteration 39510/ 125429 | consumed samples: 10114560 | consumed tokens: 20714618880 | elapsed time per iteration (s): 1.08 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.045597E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.819 | TFLOPs: 39.30 | 15: iteration 39520/ 125429 | consumed samples: 10117120 | consumed tokens: 20719861760 | elapsed time per iteration (s): 1.05 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.032158E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.277 | TFLOPs: 40.20 | 15: iteration 39530/ 125429 | consumed samples: 10119680 | consumed tokens: 20725104640 | elapsed time per iteration (s): 1.03 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.051501E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.702 | TFLOPs: 40.93 | 15: iteration 39540/ 125429 | consumed samples: 10122240 | consumed tokens: 20730347520 | elapsed time per iteration (s): 1.02 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.046644E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.852 | TFLOPs: 41.46 | 15: iteration 39550/ 125429 | consumed samples: 10124800 | consumed tokens: 20735590400 | elapsed time per iteration (s): 1.03 | learning rate: 1.610E-04 | global batch size: 256 | lm loss: 2.037760E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.658 | TFLOPs: 41.09 | 15: iteration 39560/ 125429 | consumed samples: 10127360 | consumed tokens: 20740833280 | elapsed time per iteration (s): 1.02 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.034762E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.820 | TFLOPs: 41.45 | 15: iteration 39570/ 125429 | consumed samples: 10129920 | consumed tokens: 20746076160 | elapsed time per iteration (s): 1.03 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.050169E+00 | grad norm: 0.575 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.728 | TFLOPs: 40.94 | 15: iteration 39580/ 125429 | consumed samples: 10132480 | consumed tokens: 20751319040 | elapsed time per iteration (s): 1.04 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.072962E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.013 | TFLOPs: 40.49 | 15: iteration 39590/ 125429 | consumed samples: 10135040 | consumed tokens: 20756561920 | elapsed time per iteration (s): 1.04 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.047304E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.178 | TFLOPs: 40.85 | 15: iteration 39600/ 125429 | consumed samples: 10137600 | consumed tokens: 20761804800 | elapsed time per iteration (s): 1.03 | learning rate: 1.609E-04 | global batch size: 256 | lm loss: 2.070247E+00 | grad norm: 0.279 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.619 | TFLOPs: 40.92 | 15: iteration 39610/ 125429 | consumed samples: 10140160 | consumed tokens: 20767047680 | elapsed time per iteration (s): 1.04 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.069201E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.559 | TFLOPs: 40.58 | 15: iteration 39620/ 125429 | consumed samples: 10142720 | consumed tokens: 20772290560 | elapsed time per iteration (s): 1.06 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.076613E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.919 | TFLOPs: 39.98 | 15: iteration 39630/ 125429 | consumed samples: 10145280 | consumed tokens: 20777533440 | elapsed time per iteration (s): 1.04 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.066319E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.301 | TFLOPs: 40.87 | 15: iteration 39640/ 125429 | consumed samples: 10147840 | consumed tokens: 20782776320 | elapsed time per iteration (s): 1.06 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.051823E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.976 | TFLOPs: 39.99 | 15: iteration 39650/ 125429 | consumed samples: 10150400 | consumed tokens: 20788019200 | elapsed time per iteration (s): 1.06 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.042130E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.155 | TFLOPs: 40.02 | 15: iteration 39660/ 125429 | consumed samples: 10152960 | consumed tokens: 20793262080 | elapsed time per iteration (s): 1.07 | learning rate: 1.608E-04 | global batch size: 256 | lm loss: 2.088118E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.249 | TFLOPs: 39.70 | 15: iteration 39670/ 125429 | consumed samples: 10155520 | consumed tokens: 20798504960 | elapsed time per iteration (s): 1.06 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.037057E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.814 | TFLOPs: 39.80 | 15: iteration 39680/ 125429 | consumed samples: 10158080 | consumed tokens: 20803747840 | elapsed time per iteration (s): 1.26 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.038960E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 203.196 | TFLOPs: 33.58 | 15: iteration 39690/ 125429 | consumed samples: 10160640 | consumed tokens: 20808990720 | elapsed time per iteration (s): 1.04 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.037742E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.960 | TFLOPs: 40.81 | 15: iteration 39700/ 125429 | consumed samples: 10163200 | consumed tokens: 20814233600 | elapsed time per iteration (s): 1.03 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.079214E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.093 | TFLOPs: 41.00 | 15: iteration 39710/ 125429 | consumed samples: 10165760 | consumed tokens: 20819476480 | elapsed time per iteration (s): 1.08 | learning rate: 1.607E-04 | global batch size: 256 | lm loss: 2.065039E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.675 | TFLOPs: 39.11 | 15: iteration 39720/ 125429 | consumed samples: 10168320 | consumed tokens: 20824719360 | elapsed time per iteration (s): 1.05 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.071734E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.840 | TFLOPs: 40.30 | 15: iteration 39730/ 125429 | consumed samples: 10170880 | consumed tokens: 20829962240 | elapsed time per iteration (s): 1.04 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.043115E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.430 | TFLOPs: 40.72 | 15: iteration 39740/ 125429 | consumed samples: 10173440 | consumed tokens: 20835205120 | elapsed time per iteration (s): 1.03 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.053811E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.603 | TFLOPs: 41.25 | 15: iteration 39750/ 125429 | consumed samples: 10176000 | consumed tokens: 20840448000 | elapsed time per iteration (s): 1.05 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.028584E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.050 | TFLOPs: 40.33 | 15: iteration 39760/ 125429 | consumed samples: 10178560 | consumed tokens: 20845690880 | elapsed time per iteration (s): 1.04 | learning rate: 1.606E-04 | global batch size: 256 | lm loss: 2.045288E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.970 | TFLOPs: 40.65 | 15: iteration 39770/ 125429 | consumed samples: 10181120 | consumed tokens: 20850933760 | elapsed time per iteration (s): 1.02 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.036987E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.779 | TFLOPs: 41.28 | 15: iteration 39780/ 125429 | consumed samples: 10183680 | consumed tokens: 20856176640 | elapsed time per iteration (s): 1.04 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.055324E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.386 | TFLOPs: 40.55 | 15: iteration 39790/ 125429 | consumed samples: 10186240 | consumed tokens: 20861419520 | elapsed time per iteration (s): 1.03 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.066641E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.643 | TFLOPs: 41.26 | 15: iteration 39800/ 125429 | consumed samples: 10188800 | consumed tokens: 20866662400 | elapsed time per iteration (s): 1.02 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.039890E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.603 | TFLOPs: 41.41 | 15: iteration 39810/ 125429 | consumed samples: 10191360 | consumed tokens: 20871905280 | elapsed time per iteration (s): 1.04 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.051147E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.895 | TFLOPs: 40.80 | 15: iteration 39820/ 125429 | consumed samples: 10193920 | consumed tokens: 20877148160 | elapsed time per iteration (s): 1.04 | learning rate: 1.605E-04 | global batch size: 256 | lm loss: 2.038127E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.016 | TFLOPs: 40.66 | 15: iteration 39830/ 125429 | consumed samples: 10196480 | consumed tokens: 20882391040 | elapsed time per iteration (s): 1.03 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.046569E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.661 | TFLOPs: 41.26 | 15: iteration 39840/ 125429 | consumed samples: 10199040 | consumed tokens: 20887633920 | elapsed time per iteration (s): 1.03 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.068356E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.174 | TFLOPs: 41.01 | 15: iteration 39850/ 125429 | consumed samples: 10201600 | consumed tokens: 20892876800 | elapsed time per iteration (s): 1.03 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.055414E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.950 | TFLOPs: 40.98 | 15: iteration 39860/ 125429 | consumed samples: 10204160 | consumed tokens: 20898119680 | elapsed time per iteration (s): 1.09 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.038512E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.727 | TFLOPs: 38.79 | 15: iteration 39870/ 125429 | consumed samples: 10206720 | consumed tokens: 20903362560 | elapsed time per iteration (s): 1.07 | learning rate: 1.604E-04 | global batch size: 256 | lm loss: 2.049566E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.532 | TFLOPs: 39.42 | 15: iteration 39880/ 125429 | consumed samples: 10209280 | consumed tokens: 20908605440 | elapsed time per iteration (s): 1.03 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.054787E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.146 | TFLOPs: 41.01 | 15: iteration 39890/ 125429 | consumed samples: 10211840 | consumed tokens: 20913848320 | elapsed time per iteration (s): 1.05 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.018897E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.846 | TFLOPs: 40.30 | 15: iteration 39900/ 125429 | consumed samples: 10214400 | consumed tokens: 20919091200 | elapsed time per iteration (s): 1.05 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.022961E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.676 | TFLOPs: 40.43 | 15: iteration 39910/ 125429 | consumed samples: 10216960 | consumed tokens: 20924334080 | elapsed time per iteration (s): 1.03 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.035746E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.160 | TFLOPs: 41.01 | 15: iteration 39920/ 125429 | consumed samples: 10219520 | consumed tokens: 20929576960 | elapsed time per iteration (s): 1.03 | learning rate: 1.603E-04 | global batch size: 256 | lm loss: 2.062164E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.530 | TFLOPs: 41.07 | 15: iteration 39930/ 125429 | consumed samples: 10222080 | consumed tokens: 20934819840 | elapsed time per iteration (s): 1.06 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.064469E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.125 | TFLOPs: 40.01 | 15: iteration 39940/ 125429 | consumed samples: 10224640 | consumed tokens: 20940062720 | elapsed time per iteration (s): 1.03 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.027031E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.645 | TFLOPs: 41.26 | 15: iteration 39950/ 125429 | consumed samples: 10227200 | consumed tokens: 20945305600 | elapsed time per iteration (s): 1.03 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.048311E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.621 | TFLOPs: 41.09 | 15: iteration 39960/ 125429 | consumed samples: 10229760 | consumed tokens: 20950548480 | elapsed time per iteration (s): 1.04 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.037764E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.255 | TFLOPs: 40.70 | 15: iteration 39970/ 125429 | consumed samples: 10232320 | consumed tokens: 20955791360 | elapsed time per iteration (s): 1.04 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.027514E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.118 | TFLOPs: 40.51 | 15: iteration 39980/ 125429 | consumed samples: 10234880 | consumed tokens: 20961034240 | elapsed time per iteration (s): 1.04 | learning rate: 1.602E-04 | global batch size: 256 | lm loss: 2.033538E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.492 | TFLOPs: 40.57 | 15: iteration 39990/ 125429 | consumed samples: 10237440 | consumed tokens: 20966277120 | elapsed time per iteration (s): 1.07 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.076710E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.123 | TFLOPs: 39.52 | 0: [2022-11-26 07:44:24,648] [INFO] [logging.py:68:log_dist] [Rank 0] step=40000, skipped=0, lr=[0.00016011278577013395, 0.00016011278577013395, 0.00016011278577013395], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 40000/ 125429 | consumed samples: 10240000 | consumed tokens: 20971520000 | elapsed time per iteration (s): 1.03 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.041065E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.687 | TFLOPs: 41.10 | 0: steps: 40000 loss: 1.9646 iter time (s): 1.052 samples/sec: 243.462 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 40000 | lm loss value: 1.985541E+00 | lm loss PPL: 7.282984E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 40000 to checkpoints_1b5 0: [2022-11-26 07:44:25,080] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step40000 is begin to save! 0: [2022-11-26 07:44:25,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_01-model_00-model_states.pt... 0: [2022-11-26 07:44:25,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_01-model_00-model_states.pt. 0: [2022-11-26 07:44:25,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_03-model_00-model_states.pt... 0: [2022-11-26 07:44:25,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_03-model_00-model_states.pt. 0: [2022-11-26 07:44:25,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_04-model_00-model_states.pt... 0: [2022-11-26 07:44:25,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_04-model_00-model_states.pt. 0: [2022-11-26 07:44:25,520] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_05-model_00-model_states.pt... 0: [2022-11-26 07:44:25,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_05-model_00-model_states.pt. 0: [2022-11-26 07:44:25,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_06-model_00-model_states.pt... 0: [2022-11-26 07:44:25,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_06-model_00-model_states.pt. 0: [2022-11-26 07:44:25,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_07-model_00-model_states.pt... 0: [2022-11-26 07:44:25,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_07-model_00-model_states.pt. 0: [2022-11-26 07:44:25,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_08-model_00-model_states.pt... 0: [2022-11-26 07:44:25,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_08-model_00-model_states.pt. 0: [2022-11-26 07:44:25,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_09-model_00-model_states.pt... 0: [2022-11-26 07:44:26,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_09-model_00-model_states.pt. 0: [2022-11-26 07:44:26,033] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_10-model_00-model_states.pt... 0: [2022-11-26 07:44:26,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_10-model_00-model_states.pt. 0: [2022-11-26 07:44:26,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_11-model_00-model_states.pt... 0: [2022-11-26 07:44:26,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_11-model_00-model_states.pt. 0: [2022-11-26 07:44:26,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_12-model_00-model_states.pt... 0: [2022-11-26 07:44:26,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_12-model_00-model_states.pt. 0: [2022-11-26 07:44:26,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_13-model_00-model_states.pt... 0: [2022-11-26 07:44:26,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_13-model_00-model_states.pt. 0: [2022-11-26 07:44:26,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_14-model_00-model_states.pt... 0: [2022-11-26 07:44:26,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_14-model_00-model_states.pt. 0: [2022-11-26 07:44:26,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_15-model_00-model_states.pt... 0: [2022-11-26 07:44:26,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_15-model_00-model_states.pt. 0: [2022-11-26 07:44:26,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_16-model_00-model_states.pt... 0: [2022-11-26 07:44:26,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_16-model_00-model_states.pt. 0: [2022-11-26 07:44:26,788] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_17-model_00-model_states.pt... 0: [2022-11-26 07:44:26,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_17-model_00-model_states.pt. 0: [2022-11-26 07:44:26,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_18-model_00-model_states.pt... 0: [2022-11-26 07:44:26,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_18-model_00-model_states.pt. 0: [2022-11-26 07:44:26,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_19-model_00-model_states.pt... 0: [2022-11-26 07:44:27,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_19-model_00-model_states.pt. 0: [2022-11-26 07:44:27,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_20-model_00-model_states.pt... 0: [2022-11-26 07:44:27,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_20-model_00-model_states.pt. 0: [2022-11-26 07:44:27,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_21-model_00-model_states.pt... 0: [2022-11-26 07:44:27,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_21-model_00-model_states.pt. 0: [2022-11-26 07:44:27,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_22-model_00-model_states.pt... 0: [2022-11-26 07:44:27,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_22-model_00-model_states.pt. 0: [2022-11-26 07:44:27,427] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_23-model_00-model_states.pt... 0: [2022-11-26 07:44:27,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_23-model_00-model_states.pt. 0: [2022-11-26 07:44:27,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_24-model_00-model_states.pt... 0: [2022-11-26 07:44:27,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_24-model_00-model_states.pt. 0: [2022-11-26 07:44:27,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_25-model_00-model_states.pt... 0: [2022-11-26 07:44:27,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_25-model_00-model_states.pt. 0: [2022-11-26 07:44:27,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_26-model_00-model_states.pt... 0: [2022-11-26 07:44:27,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_26-model_00-model_states.pt. 0: [2022-11-26 07:44:27,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_27-model_00-model_states.pt... 0: [2022-11-26 07:44:27,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_27-model_00-model_states.pt. 0: [2022-11-26 07:44:27,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_28-model_00-model_states.pt... 0: [2022-11-26 07:44:28,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_28-model_00-model_states.pt. 0: [2022-11-26 07:44:28,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_29-model_00-model_states.pt... 0: [2022-11-26 07:44:28,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_29-model_00-model_states.pt. 0: [2022-11-26 07:44:28,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_30-model_00-model_states.pt... 0: [2022-11-26 07:44:28,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_30-model_00-model_states.pt. 0: [2022-11-26 07:44:28,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/layer_32-model_00-model_states.pt... 0: [2022-11-26 07:44:28,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/layer_32-model_00-model_states.pt. 0: [2022-11-26 07:44:28,294] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step40000/mp_rank_00_model_states.pt 0: [2022-11-26 07:44:28,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/mp_rank_00_model_states.pt... 0: [2022-11-26 07:44:28,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/mp_rank_00_model_states.pt. 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 9: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 6: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 11: [2022-11-26 07:44:28,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step40000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 15: [2022-11-26 07:44:28,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:44:28,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:44:28,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 07:44:28,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 07:44:28,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 07:44:28,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 07:44:28,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 07:44:28,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 07:44:28,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 07:44:28,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 07:44:28,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 07:44:28,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 07:44:28,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 07:44:28,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:44:28,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 07:44:28,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:44:28,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 07:44:28,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:44:28,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:44:28,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:44:28,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:44:28,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:44:28,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:44:28,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:44:28,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:44:28,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 07:44:28,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:44:28,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 07:44:28,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 07:44:28,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 07:44:28,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 07:44:28,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 07:44:28,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 07:44:28,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 07:44:28,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 07:44:28,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 07:44:28,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:44:28,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 07:44:28,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 07:44:28,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 07:44:28,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:44:28,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 14: [2022-11-26 07:44:28,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 07:44:28,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 07:44:28,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 07:44:28,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 07:44:28,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 07:44:28,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 07:44:28,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 07:44:28,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 07:44:28,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:44:28,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:44:28,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 7: [2022-11-26 07:44:28,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:44:28,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 07:44:28,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 07:44:28,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 07:44:28,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 07:44:28,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 13: [2022-11-26 07:44:28,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 07:44:28,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 07:44:28,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 07:44:28,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 07:44:28,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 5: [2022-11-26 07:44:28,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 07:44:28,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 07:44:28,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 07:44:28,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 12: [2022-11-26 07:44:28,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 07:44:28,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 07:44:28,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 8: [2022-11-26 07:44:28,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 07:44:28,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 07:44:28,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 11: [2022-11-26 07:44:28,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 07:44:28,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 07:44:28,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 2: [2022-11-26 07:44:28,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 07:44:28,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 07:44:28,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 6: [2022-11-26 07:44:28,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 07:44:28,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 07:44:28,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 15: [2022-11-26 07:44:28,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 07:44:28,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 07:44:28,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:44:28,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 07:44:28,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 07:44:28,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 3: [2022-11-26 07:44:28,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:44:28,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 07:44:28,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 07:44:28,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 9: [2022-11-26 07:44:28,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 07:44:28,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 07:44:28,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 4: [2022-11-26 07:44:28,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: [2022-11-26 07:44:28,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 07:44:28,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:44:28,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:44:28,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:44:28,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 07:44:28,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 1: [2022-11-26 07:44:28,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 07:44:28,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 10: [2022-11-26 07:44:28,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step40000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 07:44:28,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step40000 is ready now! 0: successfully saved checkpoint at iteration 40000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3634.55 15: iteration 40010/ 125429 | consumed samples: 10242560 | consumed tokens: 20976762880 | elapsed time per iteration (s): 1.44 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.075167E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.277 | TFLOPs: 29.30 | 15: iteration 40020/ 125429 | consumed samples: 10245120 | consumed tokens: 20982005760 | elapsed time per iteration (s): 1.05 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.072560E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.704 | TFLOPs: 40.44 | 15: iteration 40030/ 125429 | consumed samples: 10247680 | consumed tokens: 20987248640 | elapsed time per iteration (s): 1.05 | learning rate: 1.601E-04 | global batch size: 256 | lm loss: 2.077654E+00 | grad norm: 3.947 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.179 | TFLOPs: 40.35 | 15: iteration 40040/ 125429 | consumed samples: 10250240 | consumed tokens: 20992491520 | elapsed time per iteration (s): 1.05 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.174783E+00 | grad norm: 0.323 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.615 | TFLOPs: 40.42 | 15: iteration 40050/ 125429 | consumed samples: 10252800 | consumed tokens: 20997734400 | elapsed time per iteration (s): 1.02 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.092316E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.463 | TFLOPs: 41.56 | 15: iteration 40060/ 125429 | consumed samples: 10255360 | consumed tokens: 21002977280 | elapsed time per iteration (s): 1.02 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.063527E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.230 | TFLOPs: 41.35 | 15: iteration 40070/ 125429 | consumed samples: 10257920 | consumed tokens: 21008220160 | elapsed time per iteration (s): 1.06 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.049416E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.164 | TFLOPs: 40.02 | 15: iteration 40080/ 125429 | consumed samples: 10260480 | consumed tokens: 21013463040 | elapsed time per iteration (s): 1.05 | learning rate: 1.600E-04 | global batch size: 256 | lm loss: 2.034000E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.759 | TFLOPs: 40.45 | 15: iteration 40090/ 125429 | consumed samples: 10263040 | consumed tokens: 21018705920 | elapsed time per iteration (s): 1.08 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.094315E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.978 | TFLOPs: 39.00 | 15: iteration 40100/ 125429 | consumed samples: 10265600 | consumed tokens: 21023948800 | elapsed time per iteration (s): 1.04 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.054487E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.191 | TFLOPs: 40.69 | 15: iteration 40110/ 125429 | consumed samples: 10268160 | consumed tokens: 21029191680 | elapsed time per iteration (s): 1.02 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.072886E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.025 | TFLOPs: 41.32 | 15: iteration 40120/ 125429 | consumed samples: 10270720 | consumed tokens: 21034434560 | elapsed time per iteration (s): 1.06 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.048812E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.961 | TFLOPs: 39.99 | 15: iteration 40130/ 125429 | consumed samples: 10273280 | consumed tokens: 21039677440 | elapsed time per iteration (s): 1.05 | learning rate: 1.599E-04 | global batch size: 256 | lm loss: 2.043445E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.261 | TFLOPs: 40.37 | 15: iteration 40140/ 125429 | consumed samples: 10275840 | consumed tokens: 21044920320 | elapsed time per iteration (s): 1.03 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.035131E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.079 | TFLOPs: 41.00 | 15: iteration 40150/ 125429 | consumed samples: 10278400 | consumed tokens: 21050163200 | elapsed time per iteration (s): 1.05 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.042630E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.542 | TFLOPs: 40.25 | 15: iteration 40160/ 125429 | consumed samples: 10280960 | consumed tokens: 21055406080 | elapsed time per iteration (s): 1.04 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.039951E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.584 | TFLOPs: 40.58 | 15: iteration 40170/ 125429 | consumed samples: 10283520 | consumed tokens: 21060648960 | elapsed time per iteration (s): 1.09 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.036607E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.291 | TFLOPs: 38.72 | 15: iteration 40180/ 125429 | consumed samples: 10286080 | consumed tokens: 21065891840 | elapsed time per iteration (s): 1.05 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.071326E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.832 | TFLOPs: 40.13 | 15: iteration 40190/ 125429 | consumed samples: 10288640 | consumed tokens: 21071134720 | elapsed time per iteration (s): 1.04 | learning rate: 1.598E-04 | global batch size: 256 | lm loss: 2.048437E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.325 | TFLOPs: 40.71 | 15: iteration 40200/ 125429 | consumed samples: 10291200 | consumed tokens: 21076377600 | elapsed time per iteration (s): 1.03 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.061340E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.715 | TFLOPs: 41.10 | 15: iteration 40210/ 125429 | consumed samples: 10293760 | consumed tokens: 21081620480 | elapsed time per iteration (s): 1.03 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.078401E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.535 | TFLOPs: 41.07 | 15: iteration 40220/ 125429 | consumed samples: 10296320 | consumed tokens: 21086863360 | elapsed time per iteration (s): 1.10 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.034297E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.719 | TFLOPs: 38.46 | 15: iteration 40230/ 125429 | consumed samples: 10298880 | consumed tokens: 21092106240 | elapsed time per iteration (s): 1.04 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.027164E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.725 | TFLOPs: 40.61 | 15: iteration 40240/ 125429 | consumed samples: 10301440 | consumed tokens: 21097349120 | elapsed time per iteration (s): 1.02 | learning rate: 1.597E-04 | global batch size: 256 | lm loss: 2.056230E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.069 | TFLOPs: 41.49 | 15: iteration 40250/ 125429 | consumed samples: 10304000 | consumed tokens: 21102592000 | elapsed time per iteration (s): 1.04 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.066343E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.445 | TFLOPs: 40.73 | 15: iteration 40260/ 125429 | consumed samples: 10306560 | consumed tokens: 21107834880 | elapsed time per iteration (s): 1.04 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.034901E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.921 | TFLOPs: 40.64 | 15: iteration 40270/ 125429 | consumed samples: 10309120 | consumed tokens: 21113077760 | elapsed time per iteration (s): 1.04 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.041527E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.993 | TFLOPs: 40.82 | 15: iteration 40280/ 125429 | consumed samples: 10311680 | consumed tokens: 21118320640 | elapsed time per iteration (s): 1.03 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.043163E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.269 | TFLOPs: 41.03 | 15: iteration 40290/ 125429 | consumed samples: 10314240 | consumed tokens: 21123563520 | elapsed time per iteration (s): 1.04 | learning rate: 1.596E-04 | global batch size: 256 | lm loss: 2.051295E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.302 | TFLOPs: 40.70 | 15: iteration 40300/ 125429 | consumed samples: 10316800 | consumed tokens: 21128806400 | elapsed time per iteration (s): 1.03 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.046914E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.270 | TFLOPs: 41.19 | 15: iteration 40310/ 125429 | consumed samples: 10319360 | consumed tokens: 21134049280 | elapsed time per iteration (s): 1.03 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.053909E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.503 | TFLOPs: 41.23 | 15: iteration 40320/ 125429 | consumed samples: 10321920 | consumed tokens: 21139292160 | elapsed time per iteration (s): 1.06 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.066958E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.094 | TFLOPs: 39.84 | 15: iteration 40330/ 125429 | consumed samples: 10324480 | consumed tokens: 21144535040 | elapsed time per iteration (s): 1.07 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.072698E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.620 | TFLOPs: 39.43 | 15: iteration 40340/ 125429 | consumed samples: 10327040 | consumed tokens: 21149777920 | elapsed time per iteration (s): 1.04 | learning rate: 1.595E-04 | global batch size: 256 | lm loss: 2.040530E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.034 | TFLOPs: 40.49 | 15: iteration 40350/ 125429 | consumed samples: 10329600 | consumed tokens: 21155020800 | elapsed time per iteration (s): 1.09 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.042952E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.820 | TFLOPs: 38.81 | 15: iteration 40360/ 125429 | consumed samples: 10332160 | consumed tokens: 21160263680 | elapsed time per iteration (s): 1.04 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.051970E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.309 | TFLOPs: 40.87 | 15: iteration 40370/ 125429 | consumed samples: 10334720 | consumed tokens: 21165506560 | elapsed time per iteration (s): 1.08 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.059363E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.189 | TFLOPs: 39.20 | 15: iteration 40380/ 125429 | consumed samples: 10337280 | consumed tokens: 21170749440 | elapsed time per iteration (s): 1.03 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.028761E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.229 | TFLOPs: 41.19 | 15: iteration 40390/ 125429 | consumed samples: 10339840 | consumed tokens: 21175992320 | elapsed time per iteration (s): 1.05 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.054421E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.151 | TFLOPs: 40.35 | 15: iteration 40400/ 125429 | consumed samples: 10342400 | consumed tokens: 21181235200 | elapsed time per iteration (s): 1.05 | learning rate: 1.594E-04 | global batch size: 256 | lm loss: 2.049561E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.748 | TFLOPs: 40.12 | 15: iteration 40410/ 125429 | consumed samples: 10344960 | consumed tokens: 21186478080 | elapsed time per iteration (s): 1.06 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.052354E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.304 | TFLOPs: 39.88 | 15: iteration 40420/ 125429 | consumed samples: 10347520 | consumed tokens: 21191720960 | elapsed time per iteration (s): 1.03 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.061111E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.943 | TFLOPs: 40.97 | 15: iteration 40430/ 125429 | consumed samples: 10350080 | consumed tokens: 21196963840 | elapsed time per iteration (s): 1.04 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.063410E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.930 | TFLOPs: 40.64 | 15: iteration 40440/ 125429 | consumed samples: 10352640 | consumed tokens: 21202206720 | elapsed time per iteration (s): 1.08 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.044300E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.536 | TFLOPs: 39.25 | 15: iteration 40450/ 125429 | consumed samples: 10355200 | consumed tokens: 21207449600 | elapsed time per iteration (s): 1.03 | learning rate: 1.593E-04 | global batch size: 256 | lm loss: 2.006448E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.618 | TFLOPs: 40.92 | 15: iteration 40460/ 125429 | consumed samples: 10357760 | consumed tokens: 21212692480 | elapsed time per iteration (s): 1.03 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.038669E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.502 | TFLOPs: 41.07 | 15: iteration 40470/ 125429 | consumed samples: 10360320 | consumed tokens: 21217935360 | elapsed time per iteration (s): 1.03 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.049654E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.487 | TFLOPs: 41.23 | 15: iteration 40480/ 125429 | consumed samples: 10362880 | consumed tokens: 21223178240 | elapsed time per iteration (s): 1.04 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.094220E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.198 | TFLOPs: 40.52 | 15: iteration 40490/ 125429 | consumed samples: 10365440 | consumed tokens: 21228421120 | elapsed time per iteration (s): 1.04 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.038936E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.095 | TFLOPs: 40.67 | 15: iteration 40500/ 125429 | consumed samples: 10368000 | consumed tokens: 21233664000 | elapsed time per iteration (s): 1.05 | learning rate: 1.592E-04 | global batch size: 256 | lm loss: 2.054851E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.834 | TFLOPs: 40.30 | 15: iteration 40510/ 125429 | consumed samples: 10370560 | consumed tokens: 21238906880 | elapsed time per iteration (s): 1.07 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.013688E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.013 | TFLOPs: 39.66 | 15: iteration 40520/ 125429 | consumed samples: 10373120 | consumed tokens: 21244149760 | elapsed time per iteration (s): 1.06 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.074335E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.310 | TFLOPs: 39.88 | 15: iteration 40530/ 125429 | consumed samples: 10375680 | consumed tokens: 21249392640 | elapsed time per iteration (s): 1.04 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.084562E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.771 | TFLOPs: 40.62 | 15: iteration 40540/ 125429 | consumed samples: 10378240 | consumed tokens: 21254635520 | elapsed time per iteration (s): 1.03 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.062073E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.043 | TFLOPs: 41.16 | 15: iteration 40550/ 125429 | consumed samples: 10380800 | consumed tokens: 21259878400 | elapsed time per iteration (s): 1.04 | learning rate: 1.591E-04 | global batch size: 256 | lm loss: 2.053672E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.362 | TFLOPs: 40.71 | 15: iteration 40560/ 125429 | consumed samples: 10383360 | consumed tokens: 21265121280 | elapsed time per iteration (s): 1.06 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.067140E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.419 | TFLOPs: 39.73 | 15: iteration 40570/ 125429 | consumed samples: 10385920 | consumed tokens: 21270364160 | elapsed time per iteration (s): 1.05 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.065028E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.668 | TFLOPs: 40.43 | 15: iteration 40580/ 125429 | consumed samples: 10388480 | consumed tokens: 21275607040 | elapsed time per iteration (s): 1.03 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.047078E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.943 | TFLOPs: 40.97 | 15: iteration 40590/ 125429 | consumed samples: 10391040 | consumed tokens: 21280849920 | elapsed time per iteration (s): 1.12 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.033078E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.401 | TFLOPs: 37.91 | 15: iteration 40600/ 125429 | consumed samples: 10393600 | consumed tokens: 21286092800 | elapsed time per iteration (s): 1.05 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.051917E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.317 | TFLOPs: 40.21 | 15: iteration 40610/ 125429 | consumed samples: 10396160 | consumed tokens: 21291335680 | elapsed time per iteration (s): 1.03 | learning rate: 1.590E-04 | global batch size: 256 | lm loss: 2.054612E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.551 | TFLOPs: 41.07 | 15: iteration 40620/ 125429 | consumed samples: 10398720 | consumed tokens: 21296578560 | elapsed time per iteration (s): 1.08 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.041206E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.563 | TFLOPs: 39.09 | 15: iteration 40630/ 125429 | consumed samples: 10401280 | consumed tokens: 21301821440 | elapsed time per iteration (s): 1.06 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.059713E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.582 | TFLOPs: 39.92 | 15: iteration 40640/ 125429 | consumed samples: 10403840 | consumed tokens: 21307064320 | elapsed time per iteration (s): 1.06 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.043441E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.429 | TFLOPs: 39.90 | 15: iteration 40650/ 125429 | consumed samples: 10406400 | consumed tokens: 21312307200 | elapsed time per iteration (s): 1.05 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.071177E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.327 | TFLOPs: 40.21 | 15: iteration 40660/ 125429 | consumed samples: 10408960 | consumed tokens: 21317550080 | elapsed time per iteration (s): 1.06 | learning rate: 1.589E-04 | global batch size: 256 | lm loss: 2.044944E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.224 | TFLOPs: 40.03 | 15: iteration 40670/ 125429 | consumed samples: 10411520 | consumed tokens: 21322792960 | elapsed time per iteration (s): 1.05 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.047175E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.612 | TFLOPs: 40.26 | 15: iteration 40680/ 125429 | consumed samples: 10414080 | consumed tokens: 21328035840 | elapsed time per iteration (s): 1.09 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.029589E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.933 | TFLOPs: 38.99 | 15: iteration 40690/ 125429 | consumed samples: 10416640 | consumed tokens: 21333278720 | elapsed time per iteration (s): 1.10 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.034476E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.110 | TFLOPs: 38.52 | 15: iteration 40700/ 125429 | consumed samples: 10419200 | consumed tokens: 21338521600 | elapsed time per iteration (s): 1.03 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.042240E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.526 | TFLOPs: 41.07 | 15: iteration 40710/ 125429 | consumed samples: 10421760 | consumed tokens: 21343764480 | elapsed time per iteration (s): 1.07 | learning rate: 1.588E-04 | global batch size: 256 | lm loss: 2.046732E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.387 | TFLOPs: 39.40 | 15: iteration 40720/ 125429 | consumed samples: 10424320 | consumed tokens: 21349007360 | elapsed time per iteration (s): 1.06 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.033709E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.232 | TFLOPs: 40.03 | 15: iteration 40730/ 125429 | consumed samples: 10426880 | consumed tokens: 21354250240 | elapsed time per iteration (s): 1.04 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.048663E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.557 | TFLOPs: 40.75 | 15: iteration 40740/ 125429 | consumed samples: 10429440 | consumed tokens: 21359493120 | elapsed time per iteration (s): 1.05 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.054952E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.519 | TFLOPs: 40.24 | 15: iteration 40750/ 125429 | consumed samples: 10432000 | consumed tokens: 21364736000 | elapsed time per iteration (s): 1.03 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.037976E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.475 | TFLOPs: 41.06 | 15: iteration 40760/ 125429 | consumed samples: 10434560 | consumed tokens: 21369978880 | elapsed time per iteration (s): 1.06 | learning rate: 1.587E-04 | global batch size: 256 | lm loss: 2.024341E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.798 | TFLOPs: 39.96 | 15: iteration 40770/ 125429 | consumed samples: 10437120 | consumed tokens: 21375221760 | elapsed time per iteration (s): 1.05 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.011431E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.705 | TFLOPs: 40.27 | 15: iteration 40780/ 125429 | consumed samples: 10439680 | consumed tokens: 21380464640 | elapsed time per iteration (s): 1.03 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.043315E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.678 | TFLOPs: 40.93 | 15: iteration 40790/ 125429 | consumed samples: 10442240 | consumed tokens: 21385707520 | elapsed time per iteration (s): 1.05 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.078181E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.799 | TFLOPs: 40.45 | 15: iteration 40800/ 125429 | consumed samples: 10444800 | consumed tokens: 21390950400 | elapsed time per iteration (s): 1.05 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.039580E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.499 | TFLOPs: 40.24 | 15: iteration 40810/ 125429 | consumed samples: 10447360 | consumed tokens: 21396193280 | elapsed time per iteration (s): 1.07 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.040848E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.728 | TFLOPs: 39.45 | 15: iteration 40820/ 125429 | consumed samples: 10449920 | consumed tokens: 21401436160 | elapsed time per iteration (s): 1.02 | learning rate: 1.586E-04 | global batch size: 256 | lm loss: 2.034683E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.132 | TFLOPs: 41.50 | 15: iteration 40830/ 125429 | consumed samples: 10452480 | consumed tokens: 21406679040 | elapsed time per iteration (s): 1.05 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.071915E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.269 | TFLOPs: 40.20 | 15: iteration 40840/ 125429 | consumed samples: 10455040 | consumed tokens: 21411921920 | elapsed time per iteration (s): 1.03 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.034228E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.856 | TFLOPs: 40.96 | 15: iteration 40850/ 125429 | consumed samples: 10457600 | consumed tokens: 21417164800 | elapsed time per iteration (s): 1.06 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.051385E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.537 | TFLOPs: 39.92 | 15: iteration 40860/ 125429 | consumed samples: 10460160 | consumed tokens: 21422407680 | elapsed time per iteration (s): 1.06 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.032786E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.311 | TFLOPs: 40.04 | 15: iteration 40870/ 125429 | consumed samples: 10462720 | consumed tokens: 21427650560 | elapsed time per iteration (s): 1.04 | learning rate: 1.585E-04 | global batch size: 256 | lm loss: 2.045871E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.010 | TFLOPs: 40.66 | 15: iteration 40880/ 125429 | consumed samples: 10465280 | consumed tokens: 21432893440 | elapsed time per iteration (s): 1.08 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.063000E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.097 | TFLOPs: 39.35 | 15: iteration 40890/ 125429 | consumed samples: 10467840 | consumed tokens: 21438136320 | elapsed time per iteration (s): 1.06 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.044947E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.282 | TFLOPs: 40.04 | 15: iteration 40900/ 125429 | consumed samples: 10470400 | consumed tokens: 21443379200 | elapsed time per iteration (s): 1.06 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.036265E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.613 | TFLOPs: 40.09 | 15: iteration 40910/ 125429 | consumed samples: 10472960 | consumed tokens: 21448622080 | elapsed time per iteration (s): 1.06 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.032355E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.291 | TFLOPs: 39.88 | 15: iteration 40920/ 125429 | consumed samples: 10475520 | consumed tokens: 21453864960 | elapsed time per iteration (s): 1.11 | learning rate: 1.584E-04 | global batch size: 256 | lm loss: 2.067500E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.823 | TFLOPs: 38.15 | 15: iteration 40930/ 125429 | consumed samples: 10478080 | consumed tokens: 21459107840 | elapsed time per iteration (s): 1.04 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.040087E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.654 | TFLOPs: 40.60 | 15: iteration 40940/ 125429 | consumed samples: 10480640 | consumed tokens: 21464350720 | elapsed time per iteration (s): 1.04 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.049300E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.861 | TFLOPs: 40.80 | 15: iteration 40950/ 125429 | consumed samples: 10483200 | consumed tokens: 21469593600 | elapsed time per iteration (s): 1.06 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.064928E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.945 | TFLOPs: 39.82 | 15: iteration 40960/ 125429 | consumed samples: 10485760 | consumed tokens: 21474836480 | elapsed time per iteration (s): 1.04 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.067151E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.651 | TFLOPs: 40.76 | 15: iteration 40970/ 125429 | consumed samples: 10488320 | consumed tokens: 21480079360 | elapsed time per iteration (s): 1.05 | learning rate: 1.583E-04 | global batch size: 256 | lm loss: 2.035379E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.963 | TFLOPs: 40.48 | 15: iteration 40980/ 125429 | consumed samples: 10490880 | consumed tokens: 21485322240 | elapsed time per iteration (s): 1.05 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.019227E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.658 | TFLOPs: 40.43 | 15: iteration 40990/ 125429 | consumed samples: 10493440 | consumed tokens: 21490565120 | elapsed time per iteration (s): 1.04 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.032017E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.081 | TFLOPs: 40.50 | 15: iteration 41000/ 125429 | consumed samples: 10496000 | consumed tokens: 21495808000 | elapsed time per iteration (s): 1.04 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.065886E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.740 | TFLOPs: 40.61 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 41000 | lm loss value: 1.964350E+00 | lm loss PPL: 7.130278E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 41000 to checkpoints_1b5 0: [2022-11-26 08:01:59,006] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step41000 is begin to save! 0: [2022-11-26 08:01:59,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:01:59,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:01:59,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:01:59,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:01:59,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:01:59,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:01:59,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:01:59,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:01:59,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:01:59,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:01:59,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:01:59,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:01:59,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:01:59,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:01:59,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:01:59,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:01:59,960] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:02:00,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:02:00,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:02:00,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:02:00,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:02:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:02:00,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:02:00,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:02:00,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:02:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:02:00,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:02:00,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:02:00,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:02:00,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:02:00,684] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:02:00,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:02:00,788] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:02:00,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:02:00,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:02:00,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:02:00,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:02:01,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:02:01,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:02:01,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:02:01,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:02:01,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:02:01,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:02:01,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:02:01,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:02:01,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:02:01,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:02:01,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:02:01,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:02:01,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:02:01,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:02:01,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:02:01,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:02:01,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:02:01,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_29-model_00-model_states.pt... 0: [2022-11-26 08:02:02,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_29-model_00-model_states.pt. 0: [2022-11-26 08:02:02,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:02:02,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:02:02,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/layer_32-model_00-model_states.pt... 0: [2022-11-26 08:02:02,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/layer_32-model_00-model_states.pt. 0: [2022-11-26 08:02:02,151] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step41000/mp_rank_00_model_states.pt 0: [2022-11-26 08:02:02,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:02:02,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:02:02,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step41000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:02:02,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 08:02:02,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 08:02:02,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:02:02,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 08:02:02,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 08:02:02,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:02:02,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 5: [2022-11-26 08:02:02,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:02:02,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:02:02,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:02:02,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 08:02:02,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:02:02,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:02:02,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,360] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:02:02,360] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:02:02,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:02:02,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:02:02,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 08:02:02,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:02:02,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:02:02,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:02:02,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 08:02:02,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 08:02:02,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 08:02:02,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:02:02,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:02:02,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 08:02:02,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:02:02,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:02:02,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:02:02,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:02:02,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:02:02,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:02:02,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:02:02,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 10: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 08:02:02,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:02:02,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:02:02,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:02:02,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 6: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:02:02,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:02:02,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:02:02,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:02:02,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:02:02,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 8: [2022-11-26 08:02:02,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 08:02:02,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:02:02,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 08:02:02,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:02:02,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:02:02,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 12: [2022-11-26 08:02:02,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 08:02:02,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:02:02,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:02:02,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:02:02,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:02:02,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 08:02:02,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 15: [2022-11-26 08:02:02,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:02:02,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 08:02:02,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 14: [2022-11-26 08:02:02,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:02:02,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:02:02,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 10: [2022-11-26 08:02:02,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:02:02,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:02:02,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 08:02:02,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 08:02:02,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 08:02:02,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 08:02:02,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:02:02,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 08:02:02,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 4: [2022-11-26 08:02:02,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:02:02,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:02:02,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 08:02:02,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 08:02:02,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 12: [2022-11-26 08:02:02,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:02:02,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:02:02,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 2: [2022-11-26 08:02:02,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:02:02,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:02:02,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 5: [2022-11-26 08:02:02,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:02:02,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:02:02,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 7: [2022-11-26 08:02:02,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:02:02,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:02:02,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 08:02:02,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 08:02:02,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 3: [2022-11-26 08:02:02,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:02:02,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 08:02:02,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 1: [2022-11-26 08:02:02,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:02:02,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:02:02,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 9: [2022-11-26 08:02:02,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:02:02,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 08:02:02,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 11: [2022-11-26 08:02:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:02:02,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:02:02,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:02:02,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 13: [2022-11-26 08:02:02,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: [2022-11-26 08:02:02,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step41000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:02:02,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step41000 is ready now! 0: successfully saved checkpoint at iteration 41000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3613.79 15: iteration 41010/ 125429 | consumed samples: 10498560 | consumed tokens: 21501050880 | elapsed time per iteration (s): 1.44 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.046822E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.309 | TFLOPs: 29.47 | 15: iteration 41020/ 125429 | consumed samples: 10501120 | consumed tokens: 21506293760 | elapsed time per iteration (s): 1.06 | learning rate: 1.582E-04 | global batch size: 256 | lm loss: 2.070861E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.198 | TFLOPs: 40.03 | 15: iteration 41030/ 125429 | consumed samples: 10503680 | consumed tokens: 21511536640 | elapsed time per iteration (s): 1.04 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.044893E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.267 | TFLOPs: 40.70 | 15: iteration 41040/ 125429 | consumed samples: 10506240 | consumed tokens: 21516779520 | elapsed time per iteration (s): 1.07 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.064545E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.535 | TFLOPs: 39.42 | 15: iteration 41050/ 125429 | consumed samples: 10508800 | consumed tokens: 21522022400 | elapsed time per iteration (s): 1.84 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.057333E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 139.489 | TFLOPs: 23.05 | 15: iteration 41060/ 125429 | consumed samples: 10511360 | consumed tokens: 21527265280 | elapsed time per iteration (s): 1.06 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.011838E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.452 | TFLOPs: 39.90 | 15: iteration 41070/ 125429 | consumed samples: 10513920 | consumed tokens: 21532508160 | elapsed time per iteration (s): 1.05 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.070364E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.065 | TFLOPs: 40.33 | 15: iteration 41080/ 125429 | consumed samples: 10516480 | consumed tokens: 21537751040 | elapsed time per iteration (s): 1.03 | learning rate: 1.581E-04 | global batch size: 256 | lm loss: 2.048179E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.923 | TFLOPs: 41.14 | 15: iteration 41090/ 125429 | consumed samples: 10519040 | consumed tokens: 21542993920 | elapsed time per iteration (s): 2.76 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.020417E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 92.735 | TFLOPs: 15.33 | 15: iteration 41100/ 125429 | consumed samples: 10521600 | consumed tokens: 21548236800 | elapsed time per iteration (s): 1.05 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.034953E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.877 | TFLOPs: 40.30 | 15: iteration 41110/ 125429 | consumed samples: 10524160 | consumed tokens: 21553479680 | elapsed time per iteration (s): 1.04 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.042109E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.809 | TFLOPs: 40.79 | 15: iteration 41120/ 125429 | consumed samples: 10526720 | consumed tokens: 21558722560 | elapsed time per iteration (s): 1.05 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.054872E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.716 | TFLOPs: 40.11 | 15: iteration 41130/ 125429 | consumed samples: 10529280 | consumed tokens: 21563965440 | elapsed time per iteration (s): 1.07 | learning rate: 1.580E-04 | global batch size: 256 | lm loss: 2.075155E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.973 | TFLOPs: 39.49 | 15: iteration 41140/ 125429 | consumed samples: 10531840 | consumed tokens: 21569208320 | elapsed time per iteration (s): 1.04 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.045285E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.897 | TFLOPs: 40.80 | 15: iteration 41150/ 125429 | consumed samples: 10534400 | consumed tokens: 21574451200 | elapsed time per iteration (s): 1.05 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.068036E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.829 | TFLOPs: 40.46 | 15: iteration 41160/ 125429 | consumed samples: 10536960 | consumed tokens: 21579694080 | elapsed time per iteration (s): 1.07 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.062844E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.949 | TFLOPs: 39.65 | 15: iteration 41170/ 125429 | consumed samples: 10539520 | consumed tokens: 21584936960 | elapsed time per iteration (s): 1.05 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.029784E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.229 | TFLOPs: 40.20 | 15: iteration 41180/ 125429 | consumed samples: 10542080 | consumed tokens: 21590179840 | elapsed time per iteration (s): 1.03 | learning rate: 1.579E-04 | global batch size: 256 | lm loss: 2.003540E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.107 | TFLOPs: 41.00 | 15: iteration 41190/ 125429 | consumed samples: 10544640 | consumed tokens: 21595422720 | elapsed time per iteration (s): 1.04 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.053713E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.998 | TFLOPs: 40.65 | 15: iteration 41200/ 125429 | consumed samples: 10547200 | consumed tokens: 21600665600 | elapsed time per iteration (s): 1.04 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.014408E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.994 | TFLOPs: 40.49 | 15: iteration 41210/ 125429 | consumed samples: 10549760 | consumed tokens: 21605908480 | elapsed time per iteration (s): 1.02 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.049256E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.987 | TFLOPs: 41.31 | 15: iteration 41220/ 125429 | consumed samples: 10552320 | consumed tokens: 21611151360 | elapsed time per iteration (s): 1.05 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.053583E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.272 | TFLOPs: 40.20 | 15: iteration 41230/ 125429 | consumed samples: 10554880 | consumed tokens: 21616394240 | elapsed time per iteration (s): 1.04 | learning rate: 1.578E-04 | global batch size: 256 | lm loss: 2.053530E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.795 | TFLOPs: 40.62 | 15: iteration 41240/ 125429 | consumed samples: 10557440 | consumed tokens: 21621637120 | elapsed time per iteration (s): 1.11 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.037100E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.156 | TFLOPs: 38.04 | 15: iteration 41250/ 125429 | consumed samples: 10560000 | consumed tokens: 21626880000 | elapsed time per iteration (s): 1.03 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.036379E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.216 | TFLOPs: 41.18 | 15: iteration 41260/ 125429 | consumed samples: 10562560 | consumed tokens: 21632122880 | elapsed time per iteration (s): 1.05 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.047101E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.466 | TFLOPs: 40.23 | 15: iteration 41270/ 125429 | consumed samples: 10565120 | consumed tokens: 21637365760 | elapsed time per iteration (s): 1.07 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.046718E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.603 | TFLOPs: 39.43 | 15: iteration 41280/ 125429 | consumed samples: 10567680 | consumed tokens: 21642608640 | elapsed time per iteration (s): 1.05 | learning rate: 1.577E-04 | global batch size: 256 | lm loss: 2.066470E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.932 | TFLOPs: 40.48 | 15: iteration 41290/ 125429 | consumed samples: 10570240 | consumed tokens: 21647851520 | elapsed time per iteration (s): 1.07 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.038137E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.318 | TFLOPs: 39.38 | 15: iteration 41300/ 125429 | consumed samples: 10572800 | consumed tokens: 21653094400 | elapsed time per iteration (s): 1.04 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.048104E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.569 | TFLOPs: 40.58 | 15: iteration 41310/ 125429 | consumed samples: 10575360 | consumed tokens: 21658337280 | elapsed time per iteration (s): 1.05 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.048604E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.909 | TFLOPs: 40.47 | 15: iteration 41320/ 125429 | consumed samples: 10577920 | consumed tokens: 21663580160 | elapsed time per iteration (s): 1.03 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.069700E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.297 | TFLOPs: 41.03 | 15: iteration 41330/ 125429 | consumed samples: 10580480 | consumed tokens: 21668823040 | elapsed time per iteration (s): 1.04 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.047009E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.723 | TFLOPs: 40.61 | 15: iteration 41340/ 125429 | consumed samples: 10583040 | consumed tokens: 21674065920 | elapsed time per iteration (s): 1.09 | learning rate: 1.576E-04 | global batch size: 256 | lm loss: 2.057663E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.827 | TFLOPs: 38.64 | 15: iteration 41350/ 125429 | consumed samples: 10585600 | consumed tokens: 21679308800 | elapsed time per iteration (s): 1.17 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.077391E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.278 | TFLOPs: 36.24 | 15: iteration 41360/ 125429 | consumed samples: 10588160 | consumed tokens: 21684551680 | elapsed time per iteration (s): 1.17 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.019853E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.318 | TFLOPs: 36.24 | 15: iteration 41370/ 125429 | consumed samples: 10590720 | consumed tokens: 21689794560 | elapsed time per iteration (s): 1.14 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.030002E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.692 | TFLOPs: 37.13 | 15: iteration 41380/ 125429 | consumed samples: 10593280 | consumed tokens: 21695037440 | elapsed time per iteration (s): 1.05 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.045716E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.461 | TFLOPs: 40.23 | 15: iteration 41390/ 125429 | consumed samples: 10595840 | consumed tokens: 21700280320 | elapsed time per iteration (s): 1.04 | learning rate: 1.575E-04 | global batch size: 256 | lm loss: 2.035726E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.870 | TFLOPs: 40.63 | 15: iteration 41400/ 125429 | consumed samples: 10598400 | consumed tokens: 21705523200 | elapsed time per iteration (s): 1.05 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.030951E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.200 | TFLOPs: 40.19 | 15: iteration 41410/ 125429 | consumed samples: 10600960 | consumed tokens: 21710766080 | elapsed time per iteration (s): 1.07 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.028780E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.042 | TFLOPs: 39.67 | 15: iteration 41420/ 125429 | consumed samples: 10603520 | consumed tokens: 21716008960 | elapsed time per iteration (s): 1.04 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.034036E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.076 | TFLOPs: 40.83 | 15: iteration 41430/ 125429 | consumed samples: 10606080 | consumed tokens: 21721251840 | elapsed time per iteration (s): 1.13 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.036660E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.195 | TFLOPs: 37.55 | 15: iteration 41440/ 125429 | consumed samples: 10608640 | consumed tokens: 21726494720 | elapsed time per iteration (s): 1.05 | learning rate: 1.574E-04 | global batch size: 256 | lm loss: 2.029804E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.786 | TFLOPs: 40.45 | 15: iteration 41450/ 125429 | consumed samples: 10611200 | consumed tokens: 21731737600 | elapsed time per iteration (s): 1.05 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.049911E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.745 | TFLOPs: 40.28 | 15: iteration 41460/ 125429 | consumed samples: 10613760 | consumed tokens: 21736980480 | elapsed time per iteration (s): 1.06 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.017907E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.387 | TFLOPs: 40.06 | 15: iteration 41470/ 125429 | consumed samples: 10616320 | consumed tokens: 21742223360 | elapsed time per iteration (s): 1.11 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.037568E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.847 | TFLOPs: 38.15 | 15: iteration 41480/ 125429 | consumed samples: 10618880 | consumed tokens: 21747466240 | elapsed time per iteration (s): 1.05 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.037838E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.047 | TFLOPs: 40.17 | 15: iteration 41490/ 125429 | consumed samples: 10621440 | consumed tokens: 21752709120 | elapsed time per iteration (s): 1.04 | learning rate: 1.573E-04 | global batch size: 256 | lm loss: 2.065251E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.701 | TFLOPs: 40.77 | 15: iteration 41500/ 125429 | consumed samples: 10624000 | consumed tokens: 21757952000 | elapsed time per iteration (s): 1.07 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.060841E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.193 | TFLOPs: 39.36 | 15: iteration 41510/ 125429 | consumed samples: 10626560 | consumed tokens: 21763194880 | elapsed time per iteration (s): 1.13 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.070981E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.882 | TFLOPs: 37.49 | 15: iteration 41520/ 125429 | consumed samples: 10629120 | consumed tokens: 21768437760 | elapsed time per iteration (s): 1.13 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.058992E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.373 | TFLOPs: 37.58 | 15: iteration 41530/ 125429 | consumed samples: 10631680 | consumed tokens: 21773680640 | elapsed time per iteration (s): 1.14 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.044161E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.174 | TFLOPs: 37.05 | 15: iteration 41540/ 125429 | consumed samples: 10634240 | consumed tokens: 21778923520 | elapsed time per iteration (s): 1.11 | learning rate: 1.572E-04 | global batch size: 256 | lm loss: 2.092371E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.893 | TFLOPs: 37.99 | 15: iteration 41550/ 125429 | consumed samples: 10636800 | consumed tokens: 21784166400 | elapsed time per iteration (s): 1.05 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.052207E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.468 | TFLOPs: 40.23 | 15: iteration 41560/ 125429 | consumed samples: 10639360 | consumed tokens: 21789409280 | elapsed time per iteration (s): 1.11 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.041600E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.453 | TFLOPs: 38.08 | 15: iteration 41570/ 125429 | consumed samples: 10641920 | consumed tokens: 21794652160 | elapsed time per iteration (s): 1.06 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.025402E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.950 | TFLOPs: 39.98 | 15: iteration 41580/ 125429 | consumed samples: 10644480 | consumed tokens: 21799895040 | elapsed time per iteration (s): 1.11 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.026713E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.012 | TFLOPs: 38.01 | 15: iteration 41590/ 125429 | consumed samples: 10647040 | consumed tokens: 21805137920 | elapsed time per iteration (s): 1.05 | learning rate: 1.571E-04 | global batch size: 256 | lm loss: 2.071514E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.593 | TFLOPs: 40.26 | 15: iteration 41600/ 125429 | consumed samples: 10649600 | consumed tokens: 21810380800 | elapsed time per iteration (s): 1.04 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.028571E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.195 | TFLOPs: 40.69 | 15: iteration 41610/ 125429 | consumed samples: 10652160 | consumed tokens: 21815623680 | elapsed time per iteration (s): 1.05 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.071429E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.819 | TFLOPs: 40.29 | 15: iteration 41620/ 125429 | consumed samples: 10654720 | consumed tokens: 21820866560 | elapsed time per iteration (s): 1.08 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.007853E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.702 | TFLOPs: 39.28 | 15: iteration 41630/ 125429 | consumed samples: 10657280 | consumed tokens: 21826109440 | elapsed time per iteration (s): 1.04 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.026587E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.377 | TFLOPs: 40.55 | 15: iteration 41640/ 125429 | consumed samples: 10659840 | consumed tokens: 21831352320 | elapsed time per iteration (s): 1.03 | learning rate: 1.570E-04 | global batch size: 256 | lm loss: 2.062208E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.971 | TFLOPs: 40.98 | 15: iteration 41650/ 125429 | consumed samples: 10662400 | consumed tokens: 21836595200 | elapsed time per iteration (s): 1.04 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.054442E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.266 | TFLOPs: 40.53 | 15: iteration 41660/ 125429 | consumed samples: 10664960 | consumed tokens: 21841838080 | elapsed time per iteration (s): 1.04 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.067242E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.158 | TFLOPs: 40.84 | 15: iteration 41670/ 125429 | consumed samples: 10667520 | consumed tokens: 21847080960 | elapsed time per iteration (s): 1.05 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.084511E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.240 | TFLOPs: 40.20 | 15: iteration 41680/ 125429 | consumed samples: 10670080 | consumed tokens: 21852323840 | elapsed time per iteration (s): 1.06 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.028568E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.038 | TFLOPs: 39.83 | 15: iteration 41690/ 125429 | consumed samples: 10672640 | consumed tokens: 21857566720 | elapsed time per iteration (s): 2.35 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.063496E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 108.805 | TFLOPs: 17.98 | 15: iteration 41700/ 125429 | consumed samples: 10675200 | consumed tokens: 21862809600 | elapsed time per iteration (s): 1.04 | learning rate: 1.569E-04 | global batch size: 256 | lm loss: 2.054477E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.142 | TFLOPs: 40.68 | 15: iteration 41710/ 125429 | consumed samples: 10677760 | consumed tokens: 21868052480 | elapsed time per iteration (s): 1.04 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.041780E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.628 | TFLOPs: 40.59 | 15: iteration 41720/ 125429 | consumed samples: 10680320 | consumed tokens: 21873295360 | elapsed time per iteration (s): 1.05 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.045198E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.234 | TFLOPs: 40.36 | 15: iteration 41730/ 125429 | consumed samples: 10682880 | consumed tokens: 21878538240 | elapsed time per iteration (s): 1.04 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.078771E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.315 | TFLOPs: 40.87 | 15: iteration 41740/ 125429 | consumed samples: 10685440 | consumed tokens: 21883781120 | elapsed time per iteration (s): 1.05 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.032074E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.582 | TFLOPs: 40.42 | 15: iteration 41750/ 125429 | consumed samples: 10688000 | consumed tokens: 21889024000 | elapsed time per iteration (s): 1.04 | learning rate: 1.568E-04 | global batch size: 256 | lm loss: 2.042836E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.383 | TFLOPs: 40.72 | 15: iteration 41760/ 125429 | consumed samples: 10690560 | consumed tokens: 21894266880 | elapsed time per iteration (s): 1.12 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.067326E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.928 | TFLOPs: 37.67 | 15: iteration 41770/ 125429 | consumed samples: 10693120 | consumed tokens: 21899509760 | elapsed time per iteration (s): 1.12 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.057171E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.437 | TFLOPs: 37.75 | 15: iteration 41780/ 125429 | consumed samples: 10695680 | consumed tokens: 21904752640 | elapsed time per iteration (s): 1.03 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.059472E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.346 | TFLOPs: 41.04 | 15: iteration 41790/ 125429 | consumed samples: 10698240 | consumed tokens: 21909995520 | elapsed time per iteration (s): 1.04 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.029627E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.114 | TFLOPs: 40.84 | 15: iteration 41800/ 125429 | consumed samples: 10700800 | consumed tokens: 21915238400 | elapsed time per iteration (s): 1.05 | learning rate: 1.567E-04 | global batch size: 256 | lm loss: 2.029989E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.391 | TFLOPs: 40.39 | 15: iteration 41810/ 125429 | consumed samples: 10703360 | consumed tokens: 21920481280 | elapsed time per iteration (s): 1.06 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.060822E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.899 | TFLOPs: 39.98 | 15: iteration 41820/ 125429 | consumed samples: 10705920 | consumed tokens: 21925724160 | elapsed time per iteration (s): 1.04 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.027819E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.885 | TFLOPs: 40.63 | 15: iteration 41830/ 125429 | consumed samples: 10708480 | consumed tokens: 21930967040 | elapsed time per iteration (s): 1.05 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.035254E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.705 | TFLOPs: 40.27 | 15: iteration 41840/ 125429 | consumed samples: 10711040 | consumed tokens: 21936209920 | elapsed time per iteration (s): 1.06 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.038317E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.532 | TFLOPs: 39.91 | 15: iteration 41850/ 125429 | consumed samples: 10713600 | consumed tokens: 21941452800 | elapsed time per iteration (s): 1.07 | learning rate: 1.566E-04 | global batch size: 256 | lm loss: 2.033345E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.238 | TFLOPs: 39.37 | 15: iteration 41860/ 125429 | consumed samples: 10716160 | consumed tokens: 21946695680 | elapsed time per iteration (s): 1.05 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.057183E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.438 | TFLOPs: 40.40 | 15: iteration 41870/ 125429 | consumed samples: 10718720 | consumed tokens: 21951938560 | elapsed time per iteration (s): 1.04 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.048436E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.220 | TFLOPs: 40.52 | 15: iteration 41880/ 125429 | consumed samples: 10721280 | consumed tokens: 21957181440 | elapsed time per iteration (s): 1.09 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.045148E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.795 | TFLOPs: 38.97 | 15: iteration 41890/ 125429 | consumed samples: 10723840 | consumed tokens: 21962424320 | elapsed time per iteration (s): 1.09 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.077702E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.148 | TFLOPs: 38.86 | 15: iteration 41900/ 125429 | consumed samples: 10726400 | consumed tokens: 21967667200 | elapsed time per iteration (s): 1.05 | learning rate: 1.565E-04 | global batch size: 256 | lm loss: 2.043354E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.203 | TFLOPs: 40.36 | 15: iteration 41910/ 125429 | consumed samples: 10728960 | consumed tokens: 21972910080 | elapsed time per iteration (s): 1.07 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.050745E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.158 | TFLOPs: 39.52 | 15: iteration 41920/ 125429 | consumed samples: 10731520 | consumed tokens: 21978152960 | elapsed time per iteration (s): 1.06 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.049129E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.537 | TFLOPs: 39.92 | 15: iteration 41930/ 125429 | consumed samples: 10734080 | consumed tokens: 21983395840 | elapsed time per iteration (s): 1.06 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.032368E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.316 | TFLOPs: 40.04 | 15: iteration 41940/ 125429 | consumed samples: 10736640 | consumed tokens: 21988638720 | elapsed time per iteration (s): 1.08 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.036565E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.898 | TFLOPs: 39.31 | 15: iteration 41950/ 125429 | consumed samples: 10739200 | consumed tokens: 21993881600 | elapsed time per iteration (s): 1.03 | learning rate: 1.564E-04 | global batch size: 256 | lm loss: 2.050533E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.924 | TFLOPs: 40.97 | 15: iteration 41960/ 125429 | consumed samples: 10741760 | consumed tokens: 21999124480 | elapsed time per iteration (s): 1.03 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.072204E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.850 | TFLOPs: 40.96 | 15: iteration 41970/ 125429 | consumed samples: 10744320 | consumed tokens: 22004367360 | elapsed time per iteration (s): 1.04 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.025889E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.543 | TFLOPs: 40.58 | 15: iteration 41980/ 125429 | consumed samples: 10746880 | consumed tokens: 22009610240 | elapsed time per iteration (s): 1.09 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.050608E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.646 | TFLOPs: 38.78 | 15: iteration 41990/ 125429 | consumed samples: 10749440 | consumed tokens: 22014853120 | elapsed time per iteration (s): 1.07 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.044115E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.315 | TFLOPs: 39.55 | 0: [2022-11-26 08:20:22,262] [INFO] [logging.py:68:log_dist] [Rank 0] step=42000, skipped=0, lr=[0.00015626755821045252, 0.00015626755821045252, 0.00015626755821045252], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 42000/ 125429 | consumed samples: 10752000 | consumed tokens: 22020096000 | elapsed time per iteration (s): 1.04 | learning rate: 1.563E-04 | global batch size: 256 | lm loss: 2.032934E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.150 | TFLOPs: 40.84 | 0: steps: 42000 loss: 2.0452 iter time (s): 1.072 samples/sec: 238.772 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 42000 | lm loss value: 2.012209E+00 | lm loss PPL: 7.479825E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 42000 to checkpoints_1b5 0: [2022-11-26 08:20:22,640] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step42000 is begin to save! 0: [2022-11-26 08:20:22,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:20:22,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:20:22,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:20:23,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:20:23,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:20:23,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:20:23,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:20:23,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:20:23,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:20:23,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:20:23,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:20:23,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:20:23,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:20:23,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:20:23,616] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:20:23,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:20:23,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:20:23,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:20:23,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:20:23,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:20:23,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:20:24,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:20:24,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:20:24,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:20:24,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:20:24,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:20:24,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:20:24,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:20:24,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:20:24,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:20:24,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:20:24,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:20:24,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:20:24,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:20:24,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:20:24,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:20:24,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:20:24,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:20:24,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:20:24,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:20:24,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:20:25,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:20:25,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:20:25,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:20:25,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:20:25,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:20:25,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:20:25,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:20:25,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:20:25,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:20:25,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:20:25,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:20:25,592] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:20:25,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:20:25,691] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_29-model_00-model_states.pt... 0: [2022-11-26 08:20:25,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_29-model_00-model_states.pt. 0: [2022-11-26 08:20:25,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:20:25,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:20:25,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/layer_32-model_00-model_states.pt... 0: [2022-11-26 08:20:25,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/layer_32-model_00-model_states.pt. 0: [2022-11-26 08:20:25,900] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step42000/mp_rank_00_model_states.pt 0: [2022-11-26 08:20:25,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:20:25,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:20:25,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step42000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:20:26,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 08:20:26,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:20:26,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:20:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 08:20:26,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:20:26,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:20:26,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 5: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:20:26,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:20:26,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 08:20:26,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:20:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 08:20:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:20:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 08:20:26,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:20:26,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 08:20:26,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 08:20:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 08:20:26,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:20:26,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:20:26,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:20:26,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 08:20:26,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 08:20:26,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 08:20:26,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 08:20:26,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:20:26,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 08:20:26,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:20:26,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:20:26,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:20:26,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 6: [2022-11-26 08:20:26,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:20:26,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:20:26,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:20:26,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 08:20:26,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 08:20:26,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 08:20:26,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 08:20:26,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 7: [2022-11-26 08:20:26,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:20:26,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:20:26,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 13: [2022-11-26 08:20:26,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:20:26,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 13: [2022-11-26 08:20:26,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:20:26,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 13: [2022-11-26 08:20:26,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 08:20:26,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:20:26,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 11: [2022-11-26 08:20:26,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:20:26,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 08:20:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 4: [2022-11-26 08:20:26,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:20:26,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:20:26,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 5: [2022-11-26 08:20:26,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:20:26,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 08:20:26,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 08:20:26,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 08:20:26,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 14: [2022-11-26 08:20:26,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:20:26,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:20:26,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 08:20:26,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:20:26,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 3: [2022-11-26 08:20:26,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:20:26,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:20:26,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 2: [2022-11-26 08:20:26,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:20:26,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 08:20:26,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 08:20:26,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 08:20:26,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 1: [2022-11-26 08:20:26,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:20:26,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:20:26,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:20:26,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 08:20:26,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:20:26,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 08:20:26,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 10: [2022-11-26 08:20:26,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:20:26,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:20:26,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 12: [2022-11-26 08:20:26,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:20:26,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:20:26,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 15: [2022-11-26 08:20:26,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: [2022-11-26 08:20:26,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step42000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:20:26,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step42000 is ready now! 0: successfully saved checkpoint at iteration 42000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3788.36 15: iteration 42010/ 125429 | consumed samples: 10754560 | consumed tokens: 22025338880 | elapsed time per iteration (s): 1.46 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.054222E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.164 | TFLOPs: 28.95 | 15: iteration 42020/ 125429 | consumed samples: 10757120 | consumed tokens: 22030581760 | elapsed time per iteration (s): 1.05 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.049034E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.795 | TFLOPs: 40.29 | 15: iteration 42030/ 125429 | consumed samples: 10759680 | consumed tokens: 22035824640 | elapsed time per iteration (s): 1.03 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.048502E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.257 | TFLOPs: 41.03 | 15: iteration 42040/ 125429 | consumed samples: 10762240 | consumed tokens: 22041067520 | elapsed time per iteration (s): 1.02 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.052151E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.319 | TFLOPs: 41.53 | 15: iteration 42050/ 125429 | consumed samples: 10764800 | consumed tokens: 22046310400 | elapsed time per iteration (s): 1.04 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.058142E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.239 | TFLOPs: 40.86 | 15: iteration 42060/ 125429 | consumed samples: 10767360 | consumed tokens: 22051553280 | elapsed time per iteration (s): 1.03 | learning rate: 1.562E-04 | global batch size: 256 | lm loss: 2.044342E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.556 | TFLOPs: 40.91 | 15: iteration 42070/ 125429 | consumed samples: 10769920 | consumed tokens: 22056796160 | elapsed time per iteration (s): 1.06 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.046026E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.515 | TFLOPs: 39.75 | 15: iteration 42080/ 125429 | consumed samples: 10772480 | consumed tokens: 22062039040 | elapsed time per iteration (s): 1.05 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.051388E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.895 | TFLOPs: 40.47 | 15: iteration 42090/ 125429 | consumed samples: 10775040 | consumed tokens: 22067281920 | elapsed time per iteration (s): 1.06 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.054648E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.904 | TFLOPs: 39.81 | 15: iteration 42100/ 125429 | consumed samples: 10777600 | consumed tokens: 22072524800 | elapsed time per iteration (s): 1.06 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.051008E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.502 | TFLOPs: 40.08 | 15: iteration 42110/ 125429 | consumed samples: 10780160 | consumed tokens: 22077767680 | elapsed time per iteration (s): 1.07 | learning rate: 1.561E-04 | global batch size: 256 | lm loss: 2.051723E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.298 | TFLOPs: 39.38 | 15: iteration 42120/ 125429 | consumed samples: 10782720 | consumed tokens: 22083010560 | elapsed time per iteration (s): 1.03 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.047966E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.522 | TFLOPs: 41.24 | 15: iteration 42130/ 125429 | consumed samples: 10785280 | consumed tokens: 22088253440 | elapsed time per iteration (s): 1.07 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.059832E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.754 | TFLOPs: 39.62 | 15: iteration 42140/ 125429 | consumed samples: 10787840 | consumed tokens: 22093496320 | elapsed time per iteration (s): 1.02 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.039108E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.272 | TFLOPs: 41.52 | 15: iteration 42150/ 125429 | consumed samples: 10790400 | consumed tokens: 22098739200 | elapsed time per iteration (s): 1.06 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.045977E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.761 | TFLOPs: 39.79 | 15: iteration 42160/ 125429 | consumed samples: 10792960 | consumed tokens: 22103982080 | elapsed time per iteration (s): 1.04 | learning rate: 1.560E-04 | global batch size: 256 | lm loss: 2.060349E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.152 | TFLOPs: 40.51 | 15: iteration 42170/ 125429 | consumed samples: 10795520 | consumed tokens: 22109224960 | elapsed time per iteration (s): 1.03 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.045895E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.180 | TFLOPs: 41.01 | 15: iteration 42180/ 125429 | consumed samples: 10798080 | consumed tokens: 22114467840 | elapsed time per iteration (s): 1.04 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.032527E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.196 | TFLOPs: 40.85 | 15: iteration 42190/ 125429 | consumed samples: 10800640 | consumed tokens: 22119710720 | elapsed time per iteration (s): 1.04 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.048116E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.109 | TFLOPs: 40.67 | 15: iteration 42200/ 125429 | consumed samples: 10803200 | consumed tokens: 22124953600 | elapsed time per iteration (s): 1.04 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.071865E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.104 | TFLOPs: 40.67 | 15: iteration 42210/ 125429 | consumed samples: 10805760 | consumed tokens: 22130196480 | elapsed time per iteration (s): 1.05 | learning rate: 1.559E-04 | global batch size: 256 | lm loss: 2.020820E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.925 | TFLOPs: 40.15 | 15: iteration 42220/ 125429 | consumed samples: 10808320 | consumed tokens: 22135439360 | elapsed time per iteration (s): 1.04 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.036273E+00 | grad norm: 0.208 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.073 | TFLOPs: 40.50 | 15: iteration 42230/ 125429 | consumed samples: 10810880 | consumed tokens: 22140682240 | elapsed time per iteration (s): 1.03 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.019450E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.293 | TFLOPs: 41.20 | 15: iteration 42240/ 125429 | consumed samples: 10813440 | consumed tokens: 22145925120 | elapsed time per iteration (s): 1.04 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.044663E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.137 | TFLOPs: 40.51 | 15: iteration 42250/ 125429 | consumed samples: 10816000 | consumed tokens: 22151168000 | elapsed time per iteration (s): 1.05 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.072890E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.593 | TFLOPs: 40.42 | 15: iteration 42260/ 125429 | consumed samples: 10818560 | consumed tokens: 22156410880 | elapsed time per iteration (s): 1.05 | learning rate: 1.558E-04 | global batch size: 256 | lm loss: 2.041163E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.760 | TFLOPs: 40.45 | 15: iteration 42270/ 125429 | consumed samples: 10821120 | consumed tokens: 22161653760 | elapsed time per iteration (s): 1.05 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.034519E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.486 | TFLOPs: 40.24 | 15: iteration 42280/ 125429 | consumed samples: 10823680 | consumed tokens: 22166896640 | elapsed time per iteration (s): 1.04 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.053914E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.935 | TFLOPs: 40.64 | 15: iteration 42290/ 125429 | consumed samples: 10826240 | consumed tokens: 22172139520 | elapsed time per iteration (s): 1.06 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.037743E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.956 | TFLOPs: 39.82 | 15: iteration 42300/ 125429 | consumed samples: 10828800 | consumed tokens: 22177382400 | elapsed time per iteration (s): 1.06 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.037189E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.643 | TFLOPs: 39.93 | 15: iteration 42310/ 125429 | consumed samples: 10831360 | consumed tokens: 22182625280 | elapsed time per iteration (s): 1.05 | learning rate: 1.557E-04 | global batch size: 256 | lm loss: 2.033917E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.903 | TFLOPs: 40.31 | 15: iteration 42320/ 125429 | consumed samples: 10833920 | consumed tokens: 22187868160 | elapsed time per iteration (s): 1.05 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.007099E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.789 | TFLOPs: 40.12 | 15: iteration 42330/ 125429 | consumed samples: 10836480 | consumed tokens: 22193111040 | elapsed time per iteration (s): 1.07 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.053693E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.734 | TFLOPs: 39.62 | 15: iteration 42340/ 125429 | consumed samples: 10839040 | consumed tokens: 22198353920 | elapsed time per iteration (s): 1.03 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.037763E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.747 | TFLOPs: 41.11 | 15: iteration 42350/ 125429 | consumed samples: 10841600 | consumed tokens: 22203596800 | elapsed time per iteration (s): 1.03 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.031896E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.121 | TFLOPs: 41.00 | 15: iteration 42360/ 125429 | consumed samples: 10844160 | consumed tokens: 22208839680 | elapsed time per iteration (s): 1.05 | learning rate: 1.556E-04 | global batch size: 256 | lm loss: 2.057638E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.837 | TFLOPs: 40.30 | 15: iteration 42370/ 125429 | consumed samples: 10846720 | consumed tokens: 22214082560 | elapsed time per iteration (s): 1.03 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.056255E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.527 | TFLOPs: 40.91 | 15: iteration 42380/ 125429 | consumed samples: 10849280 | consumed tokens: 22219325440 | elapsed time per iteration (s): 1.07 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.027121E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.397 | TFLOPs: 39.56 | 15: iteration 42390/ 125429 | consumed samples: 10851840 | consumed tokens: 22224568320 | elapsed time per iteration (s): 1.03 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.044118E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.986 | TFLOPs: 40.98 | 15: iteration 42400/ 125429 | consumed samples: 10854400 | consumed tokens: 22229811200 | elapsed time per iteration (s): 1.03 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.082012E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.787 | TFLOPs: 41.11 | 15: iteration 42410/ 125429 | consumed samples: 10856960 | consumed tokens: 22235054080 | elapsed time per iteration (s): 1.06 | learning rate: 1.555E-04 | global batch size: 256 | lm loss: 2.005442E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.656 | TFLOPs: 39.77 | 15: iteration 42420/ 125429 | consumed samples: 10859520 | consumed tokens: 22240296960 | elapsed time per iteration (s): 1.05 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.020480E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.239 | TFLOPs: 40.36 | 15: iteration 42430/ 125429 | consumed samples: 10862080 | consumed tokens: 22245539840 | elapsed time per iteration (s): 1.05 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.016442E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.316 | TFLOPs: 40.21 | 15: iteration 42440/ 125429 | consumed samples: 10864640 | consumed tokens: 22250782720 | elapsed time per iteration (s): 1.05 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.050652E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.047 | TFLOPs: 40.33 | 15: iteration 42450/ 125429 | consumed samples: 10867200 | consumed tokens: 22256025600 | elapsed time per iteration (s): 1.03 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.040532E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.483 | TFLOPs: 40.90 | 15: iteration 42460/ 125429 | consumed samples: 10869760 | consumed tokens: 22261268480 | elapsed time per iteration (s): 1.04 | learning rate: 1.554E-04 | global batch size: 256 | lm loss: 2.058310E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.094 | TFLOPs: 40.50 | 15: iteration 42470/ 125429 | consumed samples: 10872320 | consumed tokens: 22266511360 | elapsed time per iteration (s): 1.05 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.027594E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.566 | TFLOPs: 40.42 | 15: iteration 42480/ 125429 | consumed samples: 10874880 | consumed tokens: 22271754240 | elapsed time per iteration (s): 1.04 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.029462E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.657 | TFLOPs: 40.76 | 15: iteration 42490/ 125429 | consumed samples: 10877440 | consumed tokens: 22276997120 | elapsed time per iteration (s): 1.04 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.075474E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.385 | TFLOPs: 40.72 | 15: iteration 42500/ 125429 | consumed samples: 10880000 | consumed tokens: 22282240000 | elapsed time per iteration (s): 1.03 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.048067E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.235 | TFLOPs: 41.02 | 15: iteration 42510/ 125429 | consumed samples: 10882560 | consumed tokens: 22287482880 | elapsed time per iteration (s): 1.10 | learning rate: 1.553E-04 | global batch size: 256 | lm loss: 2.065711E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.401 | TFLOPs: 38.57 | 15: iteration 42520/ 125429 | consumed samples: 10885120 | consumed tokens: 22292725760 | elapsed time per iteration (s): 1.08 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.010948E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.961 | TFLOPs: 39.32 | 15: iteration 42530/ 125429 | consumed samples: 10887680 | consumed tokens: 22297968640 | elapsed time per iteration (s): 1.07 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.040229E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.353 | TFLOPs: 39.39 | 15: iteration 42540/ 125429 | consumed samples: 10890240 | consumed tokens: 22303211520 | elapsed time per iteration (s): 1.03 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.019655E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.344 | TFLOPs: 40.88 | 15: iteration 42550/ 125429 | consumed samples: 10892800 | consumed tokens: 22308454400 | elapsed time per iteration (s): 1.03 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.049826E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.699 | TFLOPs: 41.26 | 15: iteration 42560/ 125429 | consumed samples: 10895360 | consumed tokens: 22313697280 | elapsed time per iteration (s): 1.04 | learning rate: 1.552E-04 | global batch size: 256 | lm loss: 2.032276E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.304 | TFLOPs: 40.87 | 15: iteration 42570/ 125429 | consumed samples: 10897920 | consumed tokens: 22318940160 | elapsed time per iteration (s): 1.03 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.034969E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.856 | TFLOPs: 40.96 | 15: iteration 42580/ 125429 | consumed samples: 10900480 | consumed tokens: 22324183040 | elapsed time per iteration (s): 1.03 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.046244E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.866 | TFLOPs: 41.13 | 15: iteration 42590/ 125429 | consumed samples: 10903040 | consumed tokens: 22329425920 | elapsed time per iteration (s): 1.04 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.057826E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.451 | TFLOPs: 40.56 | 15: iteration 42600/ 125429 | consumed samples: 10905600 | consumed tokens: 22334668800 | elapsed time per iteration (s): 1.02 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.049778E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.098 | TFLOPs: 41.33 | 15: iteration 42610/ 125429 | consumed samples: 10908160 | consumed tokens: 22339911680 | elapsed time per iteration (s): 1.07 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.028963E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.229 | TFLOPs: 39.70 | 15: iteration 42620/ 125429 | consumed samples: 10910720 | consumed tokens: 22345154560 | elapsed time per iteration (s): 1.03 | learning rate: 1.551E-04 | global batch size: 256 | lm loss: 2.018900E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.373 | TFLOPs: 41.21 | 15: iteration 42630/ 125429 | consumed samples: 10913280 | consumed tokens: 22350397440 | elapsed time per iteration (s): 1.04 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.015144E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.986 | TFLOPs: 40.65 | 15: iteration 42640/ 125429 | consumed samples: 10915840 | consumed tokens: 22355640320 | elapsed time per iteration (s): 1.04 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.042273E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.653 | TFLOPs: 40.60 | 15: iteration 42650/ 125429 | consumed samples: 10918400 | consumed tokens: 22360883200 | elapsed time per iteration (s): 1.05 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.028690E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.430 | TFLOPs: 40.39 | 15: iteration 42660/ 125429 | consumed samples: 10920960 | consumed tokens: 22366126080 | elapsed time per iteration (s): 1.04 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.021787E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.313 | TFLOPs: 40.87 | 15: iteration 42670/ 125429 | consumed samples: 10923520 | consumed tokens: 22371368960 | elapsed time per iteration (s): 1.04 | learning rate: 1.550E-04 | global batch size: 256 | lm loss: 2.041215E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.584 | TFLOPs: 40.58 | 15: iteration 42680/ 125429 | consumed samples: 10926080 | consumed tokens: 22376611840 | elapsed time per iteration (s): 1.02 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.049109E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.981 | TFLOPs: 41.48 | 15: iteration 42690/ 125429 | consumed samples: 10928640 | consumed tokens: 22381854720 | elapsed time per iteration (s): 1.06 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.014076E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.454 | TFLOPs: 39.74 | 15: iteration 42700/ 125429 | consumed samples: 10931200 | consumed tokens: 22387097600 | elapsed time per iteration (s): 1.10 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.059536E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.610 | TFLOPs: 38.61 | 15: iteration 42710/ 125429 | consumed samples: 10933760 | consumed tokens: 22392340480 | elapsed time per iteration (s): 1.03 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.056471E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.573 | TFLOPs: 41.24 | 15: iteration 42720/ 125429 | consumed samples: 10936320 | consumed tokens: 22397583360 | elapsed time per iteration (s): 1.75 | learning rate: 1.549E-04 | global batch size: 256 | lm loss: 2.038558E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 146.661 | TFLOPs: 24.24 | 15: iteration 42730/ 125429 | consumed samples: 10938880 | consumed tokens: 22402826240 | elapsed time per iteration (s): 1.04 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.029937E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.089 | TFLOPs: 40.67 | 15: iteration 42740/ 125429 | consumed samples: 10941440 | consumed tokens: 22408069120 | elapsed time per iteration (s): 1.07 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.030275E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.195 | TFLOPs: 39.69 | 15: iteration 42750/ 125429 | consumed samples: 10944000 | consumed tokens: 22413312000 | elapsed time per iteration (s): 1.03 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.020576E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.022 | TFLOPs: 40.99 | 15: iteration 42760/ 125429 | consumed samples: 10946560 | consumed tokens: 22418554880 | elapsed time per iteration (s): 1.05 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.024794E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.229 | TFLOPs: 40.36 | 15: iteration 42770/ 125429 | consumed samples: 10949120 | consumed tokens: 22423797760 | elapsed time per iteration (s): 1.07 | learning rate: 1.548E-04 | global batch size: 256 | lm loss: 2.048194E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.252 | TFLOPs: 39.54 | 15: iteration 42780/ 125429 | consumed samples: 10951680 | consumed tokens: 22429040640 | elapsed time per iteration (s): 1.03 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.035325E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.468 | TFLOPs: 41.23 | 15: iteration 42790/ 125429 | consumed samples: 10954240 | consumed tokens: 22434283520 | elapsed time per iteration (s): 1.06 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.049533E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.952 | TFLOPs: 39.82 | 15: iteration 42800/ 125429 | consumed samples: 10956800 | consumed tokens: 22439526400 | elapsed time per iteration (s): 1.04 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.034636E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.994 | TFLOPs: 40.65 | 15: iteration 42810/ 125429 | consumed samples: 10959360 | consumed tokens: 22444769280 | elapsed time per iteration (s): 1.05 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.011237E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.048 | TFLOPs: 40.33 | 15: iteration 42820/ 125429 | consumed samples: 10961920 | consumed tokens: 22450012160 | elapsed time per iteration (s): 1.06 | learning rate: 1.547E-04 | global batch size: 256 | lm loss: 2.029273E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.424 | TFLOPs: 39.90 | 15: iteration 42830/ 125429 | consumed samples: 10964480 | consumed tokens: 22455255040 | elapsed time per iteration (s): 1.02 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.038818E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.531 | TFLOPs: 41.40 | 15: iteration 42840/ 125429 | consumed samples: 10967040 | consumed tokens: 22460497920 | elapsed time per iteration (s): 1.05 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.069921E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.686 | TFLOPs: 40.27 | 15: iteration 42850/ 125429 | consumed samples: 10969600 | consumed tokens: 22465740800 | elapsed time per iteration (s): 1.02 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.038475E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.769 | TFLOPs: 41.28 | 15: iteration 42860/ 125429 | consumed samples: 10972160 | consumed tokens: 22470983680 | elapsed time per iteration (s): 1.07 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.039585E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.705 | TFLOPs: 39.45 | 15: iteration 42870/ 125429 | consumed samples: 10974720 | consumed tokens: 22476226560 | elapsed time per iteration (s): 1.08 | learning rate: 1.546E-04 | global batch size: 256 | lm loss: 2.027760E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.774 | TFLOPs: 39.29 | 15: iteration 42880/ 125429 | consumed samples: 10977280 | consumed tokens: 22481469440 | elapsed time per iteration (s): 1.09 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.014191E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.539 | TFLOPs: 38.76 | 15: iteration 42890/ 125429 | consumed samples: 10979840 | consumed tokens: 22486712320 | elapsed time per iteration (s): 1.05 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.033470E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.019 | TFLOPs: 40.33 | 15: iteration 42900/ 125429 | consumed samples: 10982400 | consumed tokens: 22491955200 | elapsed time per iteration (s): 1.03 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.067355E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.490 | TFLOPs: 40.90 | 15: iteration 42910/ 125429 | consumed samples: 10984960 | consumed tokens: 22497198080 | elapsed time per iteration (s): 1.07 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.050430E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.582 | TFLOPs: 39.59 | 15: iteration 42920/ 125429 | consumed samples: 10987520 | consumed tokens: 22502440960 | elapsed time per iteration (s): 1.02 | learning rate: 1.545E-04 | global batch size: 256 | lm loss: 2.036408E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.819 | TFLOPs: 41.28 | 15: iteration 42930/ 125429 | consumed samples: 10990080 | consumed tokens: 22507683840 | elapsed time per iteration (s): 1.04 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.059029E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.514 | TFLOPs: 40.57 | 15: iteration 42940/ 125429 | consumed samples: 10992640 | consumed tokens: 22512926720 | elapsed time per iteration (s): 1.04 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.057414E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.675 | TFLOPs: 40.76 | 15: iteration 42950/ 125429 | consumed samples: 10995200 | consumed tokens: 22518169600 | elapsed time per iteration (s): 1.04 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.048808E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.064 | TFLOPs: 40.83 | 15: iteration 42960/ 125429 | consumed samples: 10997760 | consumed tokens: 22523412480 | elapsed time per iteration (s): 1.07 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.023201E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.243 | TFLOPs: 39.70 | 15: iteration 42970/ 125429 | consumed samples: 11000320 | consumed tokens: 22528655360 | elapsed time per iteration (s): 1.03 | learning rate: 1.544E-04 | global batch size: 256 | lm loss: 2.014770E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.848 | TFLOPs: 40.96 | 15: iteration 42980/ 125429 | consumed samples: 11002880 | consumed tokens: 22533898240 | elapsed time per iteration (s): 1.02 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.049221E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.493 | TFLOPs: 41.56 | 15: iteration 42990/ 125429 | consumed samples: 11005440 | consumed tokens: 22539141120 | elapsed time per iteration (s): 1.04 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.038894E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.277 | TFLOPs: 40.53 | 15: iteration 43000/ 125429 | consumed samples: 11008000 | consumed tokens: 22544384000 | elapsed time per iteration (s): 1.02 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.045775E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.783 | TFLOPs: 41.28 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 43000 | lm loss value: 2.041719E+00 | lm loss PPL: 7.703844E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 43000 to checkpoints_1b5 0: [2022-11-26 08:37:59,657] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step43000 is begin to save! 0: [2022-11-26 08:37:59,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:37:59,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:37:59,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:38:00,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:38:00,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:38:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:38:00,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:38:00,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:38:00,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:38:00,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:38:00,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:38:00,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:38:00,448] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:38:00,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:38:00,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:38:00,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:38:00,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:38:00,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:38:00,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:38:00,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:38:00,867] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:38:00,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:38:00,972] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:38:01,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:38:01,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:38:01,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:38:01,178] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:38:01,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:38:01,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:38:01,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:38:01,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:38:01,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:38:01,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:38:01,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:38:01,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:38:01,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:38:01,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:38:01,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:38:01,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:38:01,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:38:01,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:38:02,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:38:02,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:38:02,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:38:02,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:38:02,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:38:02,218] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:38:02,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:38:02,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:38:02,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:38:02,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:38:02,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:38:02,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:38:02,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:38:02,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_29-model_00-model_states.pt... 0: [2022-11-26 08:38:02,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_29-model_00-model_states.pt. 0: [2022-11-26 08:38:02,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:38:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:38:02,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/layer_32-model_00-model_states.pt... 0: [2022-11-26 08:38:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/layer_32-model_00-model_states.pt. 0: [2022-11-26 08:38:02,850] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step43000/mp_rank_00_model_states.pt 0: [2022-11-26 08:38:02,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:38:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:02,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step43000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:38:03,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:38:03,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:38:03,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:38:03,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 08:38:03,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 08:38:03,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:38:03,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:38:03,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:38:03,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 08:38:03,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:38:03,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:38:03,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:38:03,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:38:03,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:38:03,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 08:38:03,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:38:03,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:38:03,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:38:03,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 08:38:03,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 08:38:03,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 9: [2022-11-26 08:38:03,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:38:03,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:38:03,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:38:03,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 3: [2022-11-26 08:38:03,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 3: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 08:38:03,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 15: [2022-11-26 08:38:03,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:38:03,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:38:03,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 10: [2022-11-26 08:38:03,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:38:03,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 08:38:03,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 1: [2022-11-26 08:38:03,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:38:03,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 08:38:03,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:38:03,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:38:03,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:38:03,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 11: [2022-11-26 08:38:03,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:38:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 11: [2022-11-26 08:38:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:38:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-26 08:38:03,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:38:03,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 8: [2022-11-26 08:38:03,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:38:03,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 13: [2022-11-26 08:38:03,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:38:03,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:38:03,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 08:38:03,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 12: [2022-11-26 08:38:03,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:38:03,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:38:03,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 2: [2022-11-26 08:38:03,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:38:03,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 08:38:03,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: [2022-11-26 08:38:03,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:38:03,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 14: [2022-11-26 08:38:03,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,216] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 4: [2022-11-26 08:38:03,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:38:03,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:38:03,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 6: [2022-11-26 08:38:03,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 7: [2022-11-26 08:38:03,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:38:03,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:38:03,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:38:03,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step43000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:38:03,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 5: [2022-11-26 08:38:03,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step43000 is ready now! 0: successfully saved checkpoint at iteration 43000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3697.20 15: iteration 43010/ 125429 | consumed samples: 11010560 | consumed tokens: 22549626880 | elapsed time per iteration (s): 1.42 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.059174E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.889 | TFLOPs: 29.89 | 15: iteration 43020/ 125429 | consumed samples: 11013120 | consumed tokens: 22554869760 | elapsed time per iteration (s): 1.03 | learning rate: 1.543E-04 | global batch size: 256 | lm loss: 2.060857E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.441 | TFLOPs: 41.22 | 15: iteration 43030/ 125429 | consumed samples: 11015680 | consumed tokens: 22560112640 | elapsed time per iteration (s): 1.03 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.026449E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.595 | TFLOPs: 41.08 | 15: iteration 43040/ 125429 | consumed samples: 11018240 | consumed tokens: 22565355520 | elapsed time per iteration (s): 1.04 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.041566E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.900 | TFLOPs: 40.64 | 15: iteration 43050/ 125429 | consumed samples: 11020800 | consumed tokens: 22570598400 | elapsed time per iteration (s): 1.07 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.043829E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.207 | TFLOPs: 39.37 | 15: iteration 43060/ 125429 | consumed samples: 11023360 | consumed tokens: 22575841280 | elapsed time per iteration (s): 1.04 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.043952E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.928 | TFLOPs: 40.81 | 15: iteration 43070/ 125429 | consumed samples: 11025920 | consumed tokens: 22581084160 | elapsed time per iteration (s): 1.04 | learning rate: 1.542E-04 | global batch size: 256 | lm loss: 2.044586E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.294 | TFLOPs: 40.70 | 15: iteration 43080/ 125429 | consumed samples: 11028480 | consumed tokens: 22586327040 | elapsed time per iteration (s): 1.03 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.046630E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.542 | TFLOPs: 41.24 | 15: iteration 43090/ 125429 | consumed samples: 11031040 | consumed tokens: 22591569920 | elapsed time per iteration (s): 1.03 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.049555E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.543 | TFLOPs: 41.07 | 15: iteration 43100/ 125429 | consumed samples: 11033600 | consumed tokens: 22596812800 | elapsed time per iteration (s): 1.06 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.050370E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.974 | TFLOPs: 39.82 | 15: iteration 43110/ 125429 | consumed samples: 11036160 | consumed tokens: 22602055680 | elapsed time per iteration (s): 1.06 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.057464E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.039 | TFLOPs: 40.00 | 15: iteration 43120/ 125429 | consumed samples: 11038720 | consumed tokens: 22607298560 | elapsed time per iteration (s): 1.07 | learning rate: 1.541E-04 | global batch size: 256 | lm loss: 2.028648E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.137 | TFLOPs: 39.68 | 15: iteration 43130/ 125429 | consumed samples: 11041280 | consumed tokens: 22612541440 | elapsed time per iteration (s): 1.05 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.022048E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.486 | TFLOPs: 40.40 | 15: iteration 43140/ 125429 | consumed samples: 11043840 | consumed tokens: 22617784320 | elapsed time per iteration (s): 1.05 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.026971E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.527 | TFLOPs: 40.24 | 15: iteration 43150/ 125429 | consumed samples: 11046400 | consumed tokens: 22623027200 | elapsed time per iteration (s): 1.03 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.033363E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.375 | TFLOPs: 41.05 | 15: iteration 43160/ 125429 | consumed samples: 11048960 | consumed tokens: 22628270080 | elapsed time per iteration (s): 1.05 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.135266E+00 | grad norm: 8.017 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.862 | TFLOPs: 40.13 | 15: iteration 43170/ 125429 | consumed samples: 11051520 | consumed tokens: 22633512960 | elapsed time per iteration (s): 1.05 | learning rate: 1.540E-04 | global batch size: 256 | lm loss: 2.096643E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.787 | TFLOPs: 40.12 | 15: iteration 43180/ 125429 | consumed samples: 11054080 | consumed tokens: 22638755840 | elapsed time per iteration (s): 1.03 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.079816E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.530 | TFLOPs: 41.07 | 15: iteration 43190/ 125429 | consumed samples: 11056640 | consumed tokens: 22643998720 | elapsed time per iteration (s): 1.04 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.064534E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.894 | TFLOPs: 40.64 | 15: iteration 43200/ 125429 | consumed samples: 11059200 | consumed tokens: 22649241600 | elapsed time per iteration (s): 1.05 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.028067E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.842 | TFLOPs: 40.13 | 15: iteration 43210/ 125429 | consumed samples: 11061760 | consumed tokens: 22654484480 | elapsed time per iteration (s): 1.04 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.077541E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.743 | TFLOPs: 40.61 | 15: iteration 43220/ 125429 | consumed samples: 11064320 | consumed tokens: 22659727360 | elapsed time per iteration (s): 1.05 | learning rate: 1.539E-04 | global batch size: 256 | lm loss: 2.047356E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.540 | TFLOPs: 40.25 | 15: iteration 43230/ 125429 | consumed samples: 11066880 | consumed tokens: 22664970240 | elapsed time per iteration (s): 1.05 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.053934E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.014 | TFLOPs: 40.16 | 15: iteration 43240/ 125429 | consumed samples: 11069440 | consumed tokens: 22670213120 | elapsed time per iteration (s): 1.06 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.060386E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.288 | TFLOPs: 39.87 | 15: iteration 43250/ 125429 | consumed samples: 11072000 | consumed tokens: 22675456000 | elapsed time per iteration (s): 1.05 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.047415E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.975 | TFLOPs: 40.48 | 15: iteration 43260/ 125429 | consumed samples: 11074560 | consumed tokens: 22680698880 | elapsed time per iteration (s): 1.05 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.038014E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.936 | TFLOPs: 40.48 | 15: iteration 43270/ 125429 | consumed samples: 11077120 | consumed tokens: 22685941760 | elapsed time per iteration (s): 1.15 | learning rate: 1.538E-04 | global batch size: 256 | lm loss: 2.032066E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.747 | TFLOPs: 36.81 | 15: iteration 43280/ 125429 | consumed samples: 11079680 | consumed tokens: 22691184640 | elapsed time per iteration (s): 1.02 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.047596E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.714 | TFLOPs: 41.60 | 15: iteration 43290/ 125429 | consumed samples: 11082240 | consumed tokens: 22696427520 | elapsed time per iteration (s): 1.04 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.038934E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.104 | TFLOPs: 40.67 | 15: iteration 43300/ 125429 | consumed samples: 11084800 | consumed tokens: 22701670400 | elapsed time per iteration (s): 1.04 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.017712E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.508 | TFLOPs: 40.74 | 15: iteration 43310/ 125429 | consumed samples: 11087360 | consumed tokens: 22706913280 | elapsed time per iteration (s): 1.03 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.041617E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.358 | TFLOPs: 41.04 | 15: iteration 43320/ 125429 | consumed samples: 11089920 | consumed tokens: 22712156160 | elapsed time per iteration (s): 1.03 | learning rate: 1.537E-04 | global batch size: 256 | lm loss: 2.032418E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.891 | TFLOPs: 41.13 | 15: iteration 43330/ 125429 | consumed samples: 11092480 | consumed tokens: 22717399040 | elapsed time per iteration (s): 1.04 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.017845E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.324 | TFLOPs: 40.87 | 15: iteration 43340/ 125429 | consumed samples: 11095040 | consumed tokens: 22722641920 | elapsed time per iteration (s): 1.05 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.020107E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.953 | TFLOPs: 40.15 | 15: iteration 43350/ 125429 | consumed samples: 11097600 | consumed tokens: 22727884800 | elapsed time per iteration (s): 1.03 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.024777E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.075 | TFLOPs: 41.00 | 15: iteration 43360/ 125429 | consumed samples: 11100160 | consumed tokens: 22733127680 | elapsed time per iteration (s): 1.07 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.050718E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.024 | TFLOPs: 39.67 | 15: iteration 43370/ 125429 | consumed samples: 11102720 | consumed tokens: 22738370560 | elapsed time per iteration (s): 1.05 | learning rate: 1.536E-04 | global batch size: 256 | lm loss: 2.034803E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.271 | TFLOPs: 40.37 | 15: iteration 43380/ 125429 | consumed samples: 11105280 | consumed tokens: 22743613440 | elapsed time per iteration (s): 1.08 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.049272E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.027 | TFLOPs: 39.01 | 15: iteration 43390/ 125429 | consumed samples: 11107840 | consumed tokens: 22748856320 | elapsed time per iteration (s): 1.05 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.039767E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.179 | TFLOPs: 40.35 | 15: iteration 43400/ 125429 | consumed samples: 11110400 | consumed tokens: 22754099200 | elapsed time per iteration (s): 1.09 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.053082E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.393 | TFLOPs: 38.74 | 15: iteration 43410/ 125429 | consumed samples: 11112960 | consumed tokens: 22759342080 | elapsed time per iteration (s): 1.07 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.048623E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.354 | TFLOPs: 39.72 | 15: iteration 43420/ 125429 | consumed samples: 11115520 | consumed tokens: 22764584960 | elapsed time per iteration (s): 1.05 | learning rate: 1.535E-04 | global batch size: 256 | lm loss: 2.031409E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.729 | TFLOPs: 40.44 | 15: iteration 43430/ 125429 | consumed samples: 11118080 | consumed tokens: 22769827840 | elapsed time per iteration (s): 1.09 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.034244E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.230 | TFLOPs: 38.87 | 15: iteration 43440/ 125429 | consumed samples: 11120640 | consumed tokens: 22775070720 | elapsed time per iteration (s): 1.04 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 1.999562E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.242 | TFLOPs: 40.53 | 15: iteration 43450/ 125429 | consumed samples: 11123200 | consumed tokens: 22780313600 | elapsed time per iteration (s): 1.05 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.018539E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.958 | TFLOPs: 40.48 | 15: iteration 43460/ 125429 | consumed samples: 11125760 | consumed tokens: 22785556480 | elapsed time per iteration (s): 1.07 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.062787E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.776 | TFLOPs: 39.62 | 15: iteration 43470/ 125429 | consumed samples: 11128320 | consumed tokens: 22790799360 | elapsed time per iteration (s): 1.15 | learning rate: 1.534E-04 | global batch size: 256 | lm loss: 2.060856E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.053 | TFLOPs: 36.70 | 15: iteration 43480/ 125429 | consumed samples: 11130880 | consumed tokens: 22796042240 | elapsed time per iteration (s): 1.07 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.044891E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.865 | TFLOPs: 39.64 | 15: iteration 43490/ 125429 | consumed samples: 11133440 | consumed tokens: 22801285120 | elapsed time per iteration (s): 1.04 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.095219E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.246 | TFLOPs: 40.69 | 15: iteration 43500/ 125429 | consumed samples: 11136000 | consumed tokens: 22806528000 | elapsed time per iteration (s): 1.05 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.045474E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.659 | TFLOPs: 40.27 | 15: iteration 43510/ 125429 | consumed samples: 11138560 | consumed tokens: 22811770880 | elapsed time per iteration (s): 1.03 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.006820E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.711 | TFLOPs: 41.10 | 15: iteration 43520/ 125429 | consumed samples: 11141120 | consumed tokens: 22817013760 | elapsed time per iteration (s): 1.07 | learning rate: 1.533E-04 | global batch size: 256 | lm loss: 2.033268E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.699 | TFLOPs: 39.45 | 15: iteration 43530/ 125429 | consumed samples: 11143680 | consumed tokens: 22822256640 | elapsed time per iteration (s): 1.05 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.048499E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.597 | TFLOPs: 40.42 | 15: iteration 43540/ 125429 | consumed samples: 11146240 | consumed tokens: 22827499520 | elapsed time per iteration (s): 1.05 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.060937E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.604 | TFLOPs: 40.26 | 15: iteration 43550/ 125429 | consumed samples: 11148800 | consumed tokens: 22832742400 | elapsed time per iteration (s): 1.04 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.010976E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.734 | TFLOPs: 40.77 | 15: iteration 43560/ 125429 | consumed samples: 11151360 | consumed tokens: 22837985280 | elapsed time per iteration (s): 1.03 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.030642E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.982 | TFLOPs: 41.15 | 15: iteration 43570/ 125429 | consumed samples: 11153920 | consumed tokens: 22843228160 | elapsed time per iteration (s): 1.06 | learning rate: 1.532E-04 | global batch size: 256 | lm loss: 2.018088E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.549 | TFLOPs: 39.75 | 15: iteration 43580/ 125429 | consumed samples: 11156480 | consumed tokens: 22848471040 | elapsed time per iteration (s): 1.03 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.057050E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.616 | TFLOPs: 41.09 | 15: iteration 43590/ 125429 | consumed samples: 11159040 | consumed tokens: 22853713920 | elapsed time per iteration (s): 1.02 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.043042E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.152 | TFLOPs: 41.34 | 15: iteration 43600/ 125429 | consumed samples: 11161600 | consumed tokens: 22858956800 | elapsed time per iteration (s): 1.04 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.024093E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.283 | TFLOPs: 40.53 | 15: iteration 43610/ 125429 | consumed samples: 11164160 | consumed tokens: 22864199680 | elapsed time per iteration (s): 1.03 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.007812E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.595 | TFLOPs: 41.08 | 15: iteration 43620/ 125429 | consumed samples: 11166720 | consumed tokens: 22869442560 | elapsed time per iteration (s): 1.06 | learning rate: 1.531E-04 | global batch size: 256 | lm loss: 2.034892E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.024 | TFLOPs: 39.83 | 15: iteration 43630/ 125429 | consumed samples: 11169280 | consumed tokens: 22874685440 | elapsed time per iteration (s): 1.03 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.035040E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.767 | TFLOPs: 41.11 | 15: iteration 43640/ 125429 | consumed samples: 11171840 | consumed tokens: 22879928320 | elapsed time per iteration (s): 1.07 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.049395E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.278 | TFLOPs: 39.38 | 15: iteration 43650/ 125429 | consumed samples: 11174400 | consumed tokens: 22885171200 | elapsed time per iteration (s): 1.05 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.048509E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.509 | TFLOPs: 40.24 | 15: iteration 43660/ 125429 | consumed samples: 11176960 | consumed tokens: 22890414080 | elapsed time per iteration (s): 1.09 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.022223E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.613 | TFLOPs: 38.77 | 15: iteration 43670/ 125429 | consumed samples: 11179520 | consumed tokens: 22895656960 | elapsed time per iteration (s): 1.07 | learning rate: 1.530E-04 | global batch size: 256 | lm loss: 2.038336E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.025 | TFLOPs: 39.50 | 15: iteration 43680/ 125429 | consumed samples: 11182080 | consumed tokens: 22900899840 | elapsed time per iteration (s): 1.06 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.031326E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.074 | TFLOPs: 39.84 | 15: iteration 43690/ 125429 | consumed samples: 11184640 | consumed tokens: 22906142720 | elapsed time per iteration (s): 1.07 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.069458E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.062 | TFLOPs: 39.67 | 15: iteration 43700/ 125429 | consumed samples: 11187200 | consumed tokens: 22911385600 | elapsed time per iteration (s): 1.07 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.061373E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.260 | TFLOPs: 39.70 | 15: iteration 43710/ 125429 | consumed samples: 11189760 | consumed tokens: 22916628480 | elapsed time per iteration (s): 1.03 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.070466E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.380 | TFLOPs: 41.21 | 15: iteration 43720/ 125429 | consumed samples: 11192320 | consumed tokens: 22921871360 | elapsed time per iteration (s): 1.08 | learning rate: 1.529E-04 | global batch size: 256 | lm loss: 2.036217E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.463 | TFLOPs: 39.24 | 15: iteration 43730/ 125429 | consumed samples: 11194880 | consumed tokens: 22927114240 | elapsed time per iteration (s): 1.06 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.035455E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.100 | TFLOPs: 39.84 | 15: iteration 43740/ 125429 | consumed samples: 11197440 | consumed tokens: 22932357120 | elapsed time per iteration (s): 1.03 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.032781E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.784 | TFLOPs: 40.95 | 15: iteration 43750/ 125429 | consumed samples: 11200000 | consumed tokens: 22937600000 | elapsed time per iteration (s): 1.02 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.031995E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.720 | TFLOPs: 41.43 | 15: iteration 43760/ 125429 | consumed samples: 11202560 | consumed tokens: 22942842880 | elapsed time per iteration (s): 1.03 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.051183E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.236 | TFLOPs: 41.02 | 15: iteration 43770/ 125429 | consumed samples: 11205120 | consumed tokens: 22948085760 | elapsed time per iteration (s): 1.06 | learning rate: 1.528E-04 | global batch size: 256 | lm loss: 2.034593E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.487 | TFLOPs: 40.07 | 15: iteration 43780/ 125429 | consumed samples: 11207680 | consumed tokens: 22953328640 | elapsed time per iteration (s): 1.03 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.040108E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.679 | TFLOPs: 40.93 | 15: iteration 43790/ 125429 | consumed samples: 11210240 | consumed tokens: 22958571520 | elapsed time per iteration (s): 1.06 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.030982E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.115 | TFLOPs: 40.01 | 15: iteration 43800/ 125429 | consumed samples: 11212800 | consumed tokens: 22963814400 | elapsed time per iteration (s): 1.08 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.050334E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.339 | TFLOPs: 39.22 | 15: iteration 43810/ 125429 | consumed samples: 11215360 | consumed tokens: 22969057280 | elapsed time per iteration (s): 1.05 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.067545E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.884 | TFLOPs: 40.14 | 15: iteration 43820/ 125429 | consumed samples: 11217920 | consumed tokens: 22974300160 | elapsed time per iteration (s): 1.05 | learning rate: 1.527E-04 | global batch size: 256 | lm loss: 2.044774E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.295 | TFLOPs: 40.37 | 15: iteration 43830/ 125429 | consumed samples: 11220480 | consumed tokens: 22979543040 | elapsed time per iteration (s): 1.04 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.051781E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.987 | TFLOPs: 40.49 | 15: iteration 43840/ 125429 | consumed samples: 11223040 | consumed tokens: 22984785920 | elapsed time per iteration (s): 1.06 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.026528E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.226 | TFLOPs: 40.03 | 15: iteration 43850/ 125429 | consumed samples: 11225600 | consumed tokens: 22990028800 | elapsed time per iteration (s): 1.04 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.042638E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.055 | TFLOPs: 40.83 | 15: iteration 43860/ 125429 | consumed samples: 11228160 | consumed tokens: 22995271680 | elapsed time per iteration (s): 1.09 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.027981E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.906 | TFLOPs: 38.99 | 15: iteration 43870/ 125429 | consumed samples: 11230720 | consumed tokens: 23000514560 | elapsed time per iteration (s): 1.03 | learning rate: 1.526E-04 | global batch size: 256 | lm loss: 2.017666E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.573 | TFLOPs: 41.24 | 15: iteration 43880/ 125429 | consumed samples: 11233280 | consumed tokens: 23005757440 | elapsed time per iteration (s): 1.03 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.024623E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.790 | TFLOPs: 41.11 | 15: iteration 43890/ 125429 | consumed samples: 11235840 | consumed tokens: 23011000320 | elapsed time per iteration (s): 1.03 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.053880E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.677 | TFLOPs: 41.26 | 15: iteration 43900/ 125429 | consumed samples: 11238400 | consumed tokens: 23016243200 | elapsed time per iteration (s): 1.05 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.017431E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.646 | TFLOPs: 40.43 | 15: iteration 43910/ 125429 | consumed samples: 11240960 | consumed tokens: 23021486080 | elapsed time per iteration (s): 1.04 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.009615E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.993 | TFLOPs: 40.65 | 15: iteration 43920/ 125429 | consumed samples: 11243520 | consumed tokens: 23026728960 | elapsed time per iteration (s): 1.04 | learning rate: 1.525E-04 | global batch size: 256 | lm loss: 2.040950E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.127 | TFLOPs: 40.84 | 15: iteration 43930/ 125429 | consumed samples: 11246080 | consumed tokens: 23031971840 | elapsed time per iteration (s): 1.05 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.062743E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.423 | TFLOPs: 40.23 | 15: iteration 43940/ 125429 | consumed samples: 11248640 | consumed tokens: 23037214720 | elapsed time per iteration (s): 1.07 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.055505E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.242 | TFLOPs: 39.70 | 15: iteration 43950/ 125429 | consumed samples: 11251200 | consumed tokens: 23042457600 | elapsed time per iteration (s): 1.08 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.041518E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.894 | TFLOPs: 39.15 | 15: iteration 43960/ 125429 | consumed samples: 11253760 | consumed tokens: 23047700480 | elapsed time per iteration (s): 1.05 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.061927E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.503 | TFLOPs: 40.24 | 15: iteration 43970/ 125429 | consumed samples: 11256320 | consumed tokens: 23052943360 | elapsed time per iteration (s): 1.08 | learning rate: 1.524E-04 | global batch size: 256 | lm loss: 2.029967E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.465 | TFLOPs: 39.24 | 15: iteration 43980/ 125429 | consumed samples: 11258880 | consumed tokens: 23058186240 | elapsed time per iteration (s): 1.03 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.031615E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.744 | TFLOPs: 40.94 | 15: iteration 43990/ 125429 | consumed samples: 11261440 | consumed tokens: 23063429120 | elapsed time per iteration (s): 1.07 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.026994E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.283 | TFLOPs: 39.54 | 0: [2022-11-26 08:55:35,158] [INFO] [logging.py:68:log_dist] [Rank 0] step=44000, skipped=0, lr=[0.00015230389650588115, 0.00015230389650588115, 0.00015230389650588115], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 44000/ 125429 | consumed samples: 11264000 | consumed tokens: 23068672000 | elapsed time per iteration (s): 1.12 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.013952E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.108 | TFLOPs: 37.70 | 0: steps: 44000 loss: 1.9294 iter time (s): 1.050 samples/sec: 243.849 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 44000 | lm loss value: 1.959033E+00 | lm loss PPL: 7.092465E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 44000 to checkpoints_1b5 0: [2022-11-26 08:55:35,540] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step44000 is begin to save! 0: [2022-11-26 08:55:35,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_01-model_00-model_states.pt... 0: [2022-11-26 08:55:35,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_01-model_00-model_states.pt. 0: [2022-11-26 08:55:35,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_03-model_00-model_states.pt... 0: [2022-11-26 08:55:35,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_03-model_00-model_states.pt. 0: [2022-11-26 08:55:35,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_04-model_00-model_states.pt... 0: [2022-11-26 08:55:35,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_04-model_00-model_states.pt. 0: [2022-11-26 08:55:35,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_05-model_00-model_states.pt... 0: [2022-11-26 08:55:36,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_05-model_00-model_states.pt. 0: [2022-11-26 08:55:36,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_06-model_00-model_states.pt... 0: [2022-11-26 08:55:36,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_06-model_00-model_states.pt. 0: [2022-11-26 08:55:36,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_07-model_00-model_states.pt... 0: [2022-11-26 08:55:36,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_07-model_00-model_states.pt. 0: [2022-11-26 08:55:36,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_08-model_00-model_states.pt... 0: [2022-11-26 08:55:36,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_08-model_00-model_states.pt. 0: [2022-11-26 08:55:36,398] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_09-model_00-model_states.pt... 0: [2022-11-26 08:55:36,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_09-model_00-model_states.pt. 0: [2022-11-26 08:55:36,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_10-model_00-model_states.pt... 0: [2022-11-26 08:55:36,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_10-model_00-model_states.pt. 0: [2022-11-26 08:55:36,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_11-model_00-model_states.pt... 0: [2022-11-26 08:55:36,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_11-model_00-model_states.pt. 0: [2022-11-26 08:55:36,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_12-model_00-model_states.pt... 0: [2022-11-26 08:55:36,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_12-model_00-model_states.pt. 0: [2022-11-26 08:55:36,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_13-model_00-model_states.pt... 0: [2022-11-26 08:55:36,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_13-model_00-model_states.pt. 0: [2022-11-26 08:55:36,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_14-model_00-model_states.pt... 0: [2022-11-26 08:55:37,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_14-model_00-model_states.pt. 0: [2022-11-26 08:55:37,048] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_15-model_00-model_states.pt... 0: [2022-11-26 08:55:37,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_15-model_00-model_states.pt. 0: [2022-11-26 08:55:37,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_16-model_00-model_states.pt... 0: [2022-11-26 08:55:37,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_16-model_00-model_states.pt. 0: [2022-11-26 08:55:37,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_17-model_00-model_states.pt... 0: [2022-11-26 08:55:37,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_17-model_00-model_states.pt. 0: [2022-11-26 08:55:37,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_18-model_00-model_states.pt... 0: [2022-11-26 08:55:37,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_18-model_00-model_states.pt. 0: [2022-11-26 08:55:37,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_19-model_00-model_states.pt... 0: [2022-11-26 08:55:37,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_19-model_00-model_states.pt. 0: [2022-11-26 08:55:37,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_20-model_00-model_states.pt... 0: [2022-11-26 08:55:37,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_20-model_00-model_states.pt. 0: [2022-11-26 08:55:37,702] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_21-model_00-model_states.pt... 0: [2022-11-26 08:55:37,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_21-model_00-model_states.pt. 0: [2022-11-26 08:55:37,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_22-model_00-model_states.pt... 0: [2022-11-26 08:55:37,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_22-model_00-model_states.pt. 0: [2022-11-26 08:55:37,919] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_23-model_00-model_states.pt... 0: [2022-11-26 08:55:38,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_23-model_00-model_states.pt. 0: [2022-11-26 08:55:38,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_24-model_00-model_states.pt... 0: [2022-11-26 08:55:38,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_24-model_00-model_states.pt. 0: [2022-11-26 08:55:38,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_25-model_00-model_states.pt... 0: [2022-11-26 08:55:38,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_25-model_00-model_states.pt. 0: [2022-11-26 08:55:38,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_26-model_00-model_states.pt... 0: [2022-11-26 08:55:38,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_26-model_00-model_states.pt. 0: [2022-11-26 08:55:38,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_27-model_00-model_states.pt... 0: [2022-11-26 08:55:38,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_27-model_00-model_states.pt. 0: [2022-11-26 08:55:38,462] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_28-model_00-model_states.pt... 0: [2022-11-26 08:55:38,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_28-model_00-model_states.pt. 0: [2022-11-26 08:55:38,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_29-model_00-model_states.pt... 0: [2022-11-26 08:55:38,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_29-model_00-model_states.pt. 0: [2022-11-26 08:55:38,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_30-model_00-model_states.pt... 0: [2022-11-26 08:55:38,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_30-model_00-model_states.pt. 0: [2022-11-26 08:55:38,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/layer_32-model_00-model_states.pt... 0: [2022-11-26 08:55:38,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/layer_32-model_00-model_states.pt. 0: [2022-11-26 08:55:38,786] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step44000/mp_rank_00_model_states.pt 0: [2022-11-26 08:55:38,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/mp_rank_00_model_states.pt... 0: [2022-11-26 08:55:38,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/mp_rank_00_model_states.pt. 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 1: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 4: [2022-11-26 08:55:38,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step44000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 10: [2022-11-26 08:55:38,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:38,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:38,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:38,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:38,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:38,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:38,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:38,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:38,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:38,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:38,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:38,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:38,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:55:38,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:38,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:38,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:38,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:38,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 08:55:38,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:38,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 08:55:38,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:38,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:55:38,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:38,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 08:55:38,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:38,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:38,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:38,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:38,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:38,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 08:55:38,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:38,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:38,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:38,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:38,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:38,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:38,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:38,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:38,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:38,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:38,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:38,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 08:55:38,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:38,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:38,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:38,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:38,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:38,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:38,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 08:55:39,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:39,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:39,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 08:55:39,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:39,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:39,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:39,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 13: [2022-11-26 08:55:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 14: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:39,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:39,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 08:55:39,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:39,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:39,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:39,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:39,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:39,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:39,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:39,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:39,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:39,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:39,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:39,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:39,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:39,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:39,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:39,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:39,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:39,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:39,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:39,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:39,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:55:39,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:39,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:39,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:39,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:39,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:39,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:55:39,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:39,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:39,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:55:39,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:39,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:39,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:39,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 08:55:39,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:39,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:39,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:39,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,010] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 08:55:39,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,010] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 08:55:39,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:39,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:39,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:39,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 08:55:39,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 08:55:39,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 08:55:39,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 9: [2022-11-26 08:55:39,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:55:39,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 08:55:39,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 08:55:39,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 15: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:55:39,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:39,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 15: [2022-11-26 08:55:39,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 15: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:39,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:39,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 8: [2022-11-26 08:55:39,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 08:55:39,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 08:55:39,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:39,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:39,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 08:55:39,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 5: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 08:55:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 08:55:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 08:55:39,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:39,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 08:55:39,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 08:55:39,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:39,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:39,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:39,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 7: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 1: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 7: [2022-11-26 08:55:39,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 1: [2022-11-26 08:55:39,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 7: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 1: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 08:55:39,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:39,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:39,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:39,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:39,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:39,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 08:55:39,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:39,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 9: [2022-11-26 08:55:39,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 3: [2022-11-26 08:55:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 3: [2022-11-26 08:55:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 9: [2022-11-26 08:55:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 14: [2022-11-26 08:55:39,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 08:55:39,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 08:55:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 12: [2022-11-26 08:55:39,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 08:55:39,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 08:55:39,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 08:55:39,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:39,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:39,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 6: [2022-11-26 08:55:39,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 08:55:39,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 08:55:39,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 08:55:39,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 08:55:39,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 13: [2022-11-26 08:55:39,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 08:55:39,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 08:55:39,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: [2022-11-26 08:55:39,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 08:55:39,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 2: [2022-11-26 08:55:39,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 08:55:39,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 08:55:39,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:39,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:39,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:39,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:39,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:39,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:39,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 4: [2022-11-26 08:55:39,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 08:55:39,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 08:55:39,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 08:55:39,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:39,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:39,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 10: [2022-11-26 08:55:39,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 08:55:39,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 08:55:39,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 11: [2022-11-26 08:55:39,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 08:55:39,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step44000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 08:55:39,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step44000 is ready now! 0: successfully saved checkpoint at iteration 44000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3776.19 15: iteration 44010/ 125429 | consumed samples: 11266560 | consumed tokens: 23073914880 | elapsed time per iteration (s): 1.43 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.019703E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.667 | TFLOPs: 29.53 | 15: iteration 44020/ 125429 | consumed samples: 11269120 | consumed tokens: 23079157760 | elapsed time per iteration (s): 1.04 | learning rate: 1.523E-04 | global batch size: 256 | lm loss: 2.036098E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.072 | TFLOPs: 40.83 | 15: iteration 44030/ 125429 | consumed samples: 11271680 | consumed tokens: 23084400640 | elapsed time per iteration (s): 1.04 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.015736E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.001 | TFLOPs: 40.82 | 15: iteration 44040/ 125429 | consumed samples: 11274240 | consumed tokens: 23089643520 | elapsed time per iteration (s): 1.06 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.074430E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.717 | TFLOPs: 39.95 | 15: iteration 44050/ 125429 | consumed samples: 11276800 | consumed tokens: 23094886400 | elapsed time per iteration (s): 1.04 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.032829E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.879 | TFLOPs: 40.63 | 15: iteration 44060/ 125429 | consumed samples: 11279360 | consumed tokens: 23100129280 | elapsed time per iteration (s): 1.04 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.025800E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.235 | TFLOPs: 40.53 | 15: iteration 44070/ 125429 | consumed samples: 11281920 | consumed tokens: 23105372160 | elapsed time per iteration (s): 1.03 | learning rate: 1.522E-04 | global batch size: 256 | lm loss: 2.049527E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.847 | TFLOPs: 40.96 | 15: iteration 44080/ 125429 | consumed samples: 11284480 | consumed tokens: 23110615040 | elapsed time per iteration (s): 1.05 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.026095E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.876 | TFLOPs: 40.14 | 15: iteration 44090/ 125429 | consumed samples: 11287040 | consumed tokens: 23115857920 | elapsed time per iteration (s): 1.04 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.030464E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.159 | TFLOPs: 40.68 | 15: iteration 44100/ 125429 | consumed samples: 11289600 | consumed tokens: 23121100800 | elapsed time per iteration (s): 1.04 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.050555E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.241 | TFLOPs: 40.53 | 15: iteration 44110/ 125429 | consumed samples: 11292160 | consumed tokens: 23126343680 | elapsed time per iteration (s): 1.02 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.062041E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.353 | TFLOPs: 41.37 | 15: iteration 44120/ 125429 | consumed samples: 11294720 | consumed tokens: 23131586560 | elapsed time per iteration (s): 1.03 | learning rate: 1.521E-04 | global batch size: 256 | lm loss: 2.031392E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.341 | TFLOPs: 41.21 | 15: iteration 44130/ 125429 | consumed samples: 11297280 | consumed tokens: 23136829440 | elapsed time per iteration (s): 1.03 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.030335E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.104 | TFLOPs: 41.17 | 15: iteration 44140/ 125429 | consumed samples: 11299840 | consumed tokens: 23142072320 | elapsed time per iteration (s): 1.04 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.016788E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.198 | TFLOPs: 40.69 | 15: iteration 44150/ 125429 | consumed samples: 11302400 | consumed tokens: 23147315200 | elapsed time per iteration (s): 1.02 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.065081E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.355 | TFLOPs: 41.54 | 15: iteration 44160/ 125429 | consumed samples: 11304960 | consumed tokens: 23152558080 | elapsed time per iteration (s): 1.04 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.032545E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.485 | TFLOPs: 40.57 | 15: iteration 44170/ 125429 | consumed samples: 11307520 | consumed tokens: 23157800960 | elapsed time per iteration (s): 1.06 | learning rate: 1.520E-04 | global batch size: 256 | lm loss: 2.015774E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.547 | TFLOPs: 39.75 | 15: iteration 44180/ 125429 | consumed samples: 11310080 | consumed tokens: 23163043840 | elapsed time per iteration (s): 1.04 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.012420E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.049 | TFLOPs: 40.66 | 15: iteration 44190/ 125429 | consumed samples: 11312640 | consumed tokens: 23168286720 | elapsed time per iteration (s): 1.04 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.030714E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.197 | TFLOPs: 40.69 | 15: iteration 44200/ 125429 | consumed samples: 11315200 | consumed tokens: 23173529600 | elapsed time per iteration (s): 1.03 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.008879E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.057 | TFLOPs: 40.99 | 15: iteration 44210/ 125429 | consumed samples: 11317760 | consumed tokens: 23178772480 | elapsed time per iteration (s): 1.04 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.034268E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.068 | TFLOPs: 40.66 | 15: iteration 44220/ 125429 | consumed samples: 11320320 | consumed tokens: 23184015360 | elapsed time per iteration (s): 1.07 | learning rate: 1.519E-04 | global batch size: 256 | lm loss: 2.040649E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.121 | TFLOPs: 39.52 | 15: iteration 44230/ 125429 | consumed samples: 11322880 | consumed tokens: 23189258240 | elapsed time per iteration (s): 1.07 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.019187E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.986 | TFLOPs: 39.49 | 15: iteration 44240/ 125429 | consumed samples: 11325440 | consumed tokens: 23194501120 | elapsed time per iteration (s): 1.05 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.044135E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.098 | TFLOPs: 40.17 | 15: iteration 44250/ 125429 | consumed samples: 11328000 | consumed tokens: 23199744000 | elapsed time per iteration (s): 1.05 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.038379E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.750 | TFLOPs: 40.45 | 15: iteration 44260/ 125429 | consumed samples: 11330560 | consumed tokens: 23204986880 | elapsed time per iteration (s): 1.04 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.036673E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.494 | TFLOPs: 40.73 | 15: iteration 44270/ 125429 | consumed samples: 11333120 | consumed tokens: 23210229760 | elapsed time per iteration (s): 1.03 | learning rate: 1.518E-04 | global batch size: 256 | lm loss: 2.027522E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.697 | TFLOPs: 41.26 | 15: iteration 44280/ 125429 | consumed samples: 11335680 | consumed tokens: 23215472640 | elapsed time per iteration (s): 1.06 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.041456E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.833 | TFLOPs: 39.80 | 15: iteration 44290/ 125429 | consumed samples: 11338240 | consumed tokens: 23220715520 | elapsed time per iteration (s): 1.04 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.044072E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.914 | TFLOPs: 40.64 | 15: iteration 44300/ 125429 | consumed samples: 11340800 | consumed tokens: 23225958400 | elapsed time per iteration (s): 1.03 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 1.997846E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.554 | TFLOPs: 41.24 | 15: iteration 44310/ 125429 | consumed samples: 11343360 | consumed tokens: 23231201280 | elapsed time per iteration (s): 1.03 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.050464E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.209 | TFLOPs: 41.18 | 15: iteration 44320/ 125429 | consumed samples: 11345920 | consumed tokens: 23236444160 | elapsed time per iteration (s): 1.03 | learning rate: 1.517E-04 | global batch size: 256 | lm loss: 2.014420E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.410 | TFLOPs: 41.05 | 15: iteration 44330/ 125429 | consumed samples: 11348480 | consumed tokens: 23241687040 | elapsed time per iteration (s): 1.05 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.040515E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.501 | TFLOPs: 40.41 | 15: iteration 44340/ 125429 | consumed samples: 11351040 | consumed tokens: 23246929920 | elapsed time per iteration (s): 1.02 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.010942E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.837 | TFLOPs: 41.29 | 15: iteration 44350/ 125429 | consumed samples: 11353600 | consumed tokens: 23252172800 | elapsed time per iteration (s): 1.03 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.033913E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.847 | TFLOPs: 40.96 | 15: iteration 44360/ 125429 | consumed samples: 11356160 | consumed tokens: 23257415680 | elapsed time per iteration (s): 1.03 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.033554E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.401 | TFLOPs: 40.88 | 15: iteration 44370/ 125429 | consumed samples: 11358720 | consumed tokens: 23262658560 | elapsed time per iteration (s): 1.06 | learning rate: 1.516E-04 | global batch size: 256 | lm loss: 2.025718E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.264 | TFLOPs: 40.04 | 15: iteration 44380/ 125429 | consumed samples: 11361280 | consumed tokens: 23267901440 | elapsed time per iteration (s): 1.05 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.052693E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.717 | TFLOPs: 40.28 | 15: iteration 44390/ 125429 | consumed samples: 11363840 | consumed tokens: 23273144320 | elapsed time per iteration (s): 1.03 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.044973E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.240 | TFLOPs: 41.02 | 15: iteration 44400/ 125429 | consumed samples: 11366400 | consumed tokens: 23278387200 | elapsed time per iteration (s): 1.07 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.026260E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.105 | TFLOPs: 39.68 | 15: iteration 44410/ 125429 | consumed samples: 11368960 | consumed tokens: 23283630080 | elapsed time per iteration (s): 1.03 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.043004E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.028 | TFLOPs: 40.99 | 15: iteration 44420/ 125429 | consumed samples: 11371520 | consumed tokens: 23288872960 | elapsed time per iteration (s): 1.02 | learning rate: 1.515E-04 | global batch size: 256 | lm loss: 2.040713E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.065 | TFLOPs: 41.33 | 15: iteration 44430/ 125429 | consumed samples: 11374080 | consumed tokens: 23294115840 | elapsed time per iteration (s): 1.03 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.003192E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.742 | TFLOPs: 41.11 | 15: iteration 44440/ 125429 | consumed samples: 11376640 | consumed tokens: 23299358720 | elapsed time per iteration (s): 1.03 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.028421E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.592 | TFLOPs: 40.92 | 15: iteration 44450/ 125429 | consumed samples: 11379200 | consumed tokens: 23304601600 | elapsed time per iteration (s): 1.04 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.057906E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.596 | TFLOPs: 40.59 | 15: iteration 44460/ 125429 | consumed samples: 11381760 | consumed tokens: 23309844480 | elapsed time per iteration (s): 1.03 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.009726E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.969 | TFLOPs: 41.14 | 15: iteration 44470/ 125429 | consumed samples: 11384320 | consumed tokens: 23315087360 | elapsed time per iteration (s): 1.03 | learning rate: 1.514E-04 | global batch size: 256 | lm loss: 2.024454E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.548 | TFLOPs: 41.07 | 15: iteration 44480/ 125429 | consumed samples: 11386880 | consumed tokens: 23320330240 | elapsed time per iteration (s): 1.06 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.049525E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.841 | TFLOPs: 39.80 | 15: iteration 44490/ 125429 | consumed samples: 11389440 | consumed tokens: 23325573120 | elapsed time per iteration (s): 1.07 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.005742E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.435 | TFLOPs: 39.40 | 15: iteration 44500/ 125429 | consumed samples: 11392000 | consumed tokens: 23330816000 | elapsed time per iteration (s): 1.05 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.032797E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.017 | TFLOPs: 40.33 | 15: iteration 44510/ 125429 | consumed samples: 11394560 | consumed tokens: 23336058880 | elapsed time per iteration (s): 1.05 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.047353E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.540 | TFLOPs: 40.25 | 15: iteration 44520/ 125429 | consumed samples: 11397120 | consumed tokens: 23341301760 | elapsed time per iteration (s): 1.05 | learning rate: 1.513E-04 | global batch size: 256 | lm loss: 2.038535E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.027 | TFLOPs: 40.16 | 15: iteration 44530/ 125429 | consumed samples: 11399680 | consumed tokens: 23346544640 | elapsed time per iteration (s): 1.06 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.007548E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.544 | TFLOPs: 40.08 | 15: iteration 44540/ 125429 | consumed samples: 11402240 | consumed tokens: 23351787520 | elapsed time per iteration (s): 1.04 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.050804E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.523 | TFLOPs: 40.74 | 15: iteration 44550/ 125429 | consumed samples: 11404800 | consumed tokens: 23357030400 | elapsed time per iteration (s): 1.03 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.028027E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.447 | TFLOPs: 41.06 | 15: iteration 44560/ 125429 | consumed samples: 11407360 | consumed tokens: 23362273280 | elapsed time per iteration (s): 1.02 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.010766E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.464 | TFLOPs: 41.39 | 15: iteration 44570/ 125429 | consumed samples: 11409920 | consumed tokens: 23367516160 | elapsed time per iteration (s): 1.10 | learning rate: 1.512E-04 | global batch size: 256 | lm loss: 2.041162E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.702 | TFLOPs: 38.46 | 15: iteration 44580/ 125429 | consumed samples: 11412480 | consumed tokens: 23372759040 | elapsed time per iteration (s): 1.03 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.035644E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.742 | TFLOPs: 41.27 | 15: iteration 44590/ 125429 | consumed samples: 11415040 | consumed tokens: 23378001920 | elapsed time per iteration (s): 1.03 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.021537E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.591 | TFLOPs: 41.08 | 15: iteration 44600/ 125429 | consumed samples: 11417600 | consumed tokens: 23383244800 | elapsed time per iteration (s): 1.03 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.015726E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.497 | TFLOPs: 41.07 | 15: iteration 44610/ 125429 | consumed samples: 11420160 | consumed tokens: 23388487680 | elapsed time per iteration (s): 1.07 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.021144E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.502 | TFLOPs: 39.58 | 15: iteration 44620/ 125429 | consumed samples: 11422720 | consumed tokens: 23393730560 | elapsed time per iteration (s): 1.03 | learning rate: 1.511E-04 | global batch size: 256 | lm loss: 2.037560E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.722 | TFLOPs: 41.27 | 15: iteration 44630/ 125429 | consumed samples: 11425280 | consumed tokens: 23398973440 | elapsed time per iteration (s): 1.04 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.037091E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.252 | TFLOPs: 40.70 | 15: iteration 44640/ 125429 | consumed samples: 11427840 | consumed tokens: 23404216320 | elapsed time per iteration (s): 1.04 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.031824E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.860 | TFLOPs: 40.63 | 15: iteration 44650/ 125429 | consumed samples: 11430400 | consumed tokens: 23409459200 | elapsed time per iteration (s): 1.07 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.026217E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.301 | TFLOPs: 39.71 | 15: iteration 44660/ 125429 | consumed samples: 11432960 | consumed tokens: 23414702080 | elapsed time per iteration (s): 1.10 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 1.999497E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.862 | TFLOPs: 38.48 | 15: iteration 44670/ 125429 | consumed samples: 11435520 | consumed tokens: 23419944960 | elapsed time per iteration (s): 1.03 | learning rate: 1.510E-04 | global batch size: 256 | lm loss: 2.032440E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.919 | TFLOPs: 41.14 | 15: iteration 44680/ 125429 | consumed samples: 11438080 | consumed tokens: 23425187840 | elapsed time per iteration (s): 1.05 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.010027E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.443 | TFLOPs: 40.40 | 15: iteration 44690/ 125429 | consumed samples: 11440640 | consumed tokens: 23430430720 | elapsed time per iteration (s): 1.04 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.018641E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.186 | TFLOPs: 40.85 | 15: iteration 44700/ 125429 | consumed samples: 11443200 | consumed tokens: 23435673600 | elapsed time per iteration (s): 1.03 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.058666E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.484 | TFLOPs: 40.90 | 15: iteration 44710/ 125429 | consumed samples: 11445760 | consumed tokens: 23440916480 | elapsed time per iteration (s): 1.05 | learning rate: 1.509E-04 | global batch size: 256 | lm loss: 2.051385E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.392 | TFLOPs: 40.22 | 15: iteration 44720/ 125429 | consumed samples: 11448320 | consumed tokens: 23446159360 | elapsed time per iteration (s): 1.06 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.028814E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.080 | TFLOPs: 40.01 | 15: iteration 44730/ 125429 | consumed samples: 11450880 | consumed tokens: 23451402240 | elapsed time per iteration (s): 1.03 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.027246E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.045 | TFLOPs: 40.99 | 15: iteration 44740/ 125429 | consumed samples: 11453440 | consumed tokens: 23456645120 | elapsed time per iteration (s): 1.06 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.037248E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.814 | TFLOPs: 39.96 | 15: iteration 44750/ 125429 | consumed samples: 11456000 | consumed tokens: 23461888000 | elapsed time per iteration (s): 1.05 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.043192E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.212 | TFLOPs: 40.36 | 15: iteration 44760/ 125429 | consumed samples: 11458560 | consumed tokens: 23467130880 | elapsed time per iteration (s): 1.05 | learning rate: 1.508E-04 | global batch size: 256 | lm loss: 2.025479E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.475 | TFLOPs: 40.24 | 15: iteration 44770/ 125429 | consumed samples: 11461120 | consumed tokens: 23472373760 | elapsed time per iteration (s): 1.04 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.051973E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.990 | TFLOPs: 40.49 | 15: iteration 44780/ 125429 | consumed samples: 11463680 | consumed tokens: 23477616640 | elapsed time per iteration (s): 1.49 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.028986E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.173 | TFLOPs: 28.45 | 15: iteration 44790/ 125429 | consumed samples: 11466240 | consumed tokens: 23482859520 | elapsed time per iteration (s): 1.08 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.040327E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.829 | TFLOPs: 39.14 | 15: iteration 44800/ 125429 | consumed samples: 11468800 | consumed tokens: 23488102400 | elapsed time per iteration (s): 1.07 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.038810E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.294 | TFLOPs: 39.55 | 15: iteration 44810/ 125429 | consumed samples: 11471360 | consumed tokens: 23493345280 | elapsed time per iteration (s): 1.04 | learning rate: 1.507E-04 | global batch size: 256 | lm loss: 2.024391E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.779 | TFLOPs: 40.62 | 15: iteration 44820/ 125429 | consumed samples: 11473920 | consumed tokens: 23498588160 | elapsed time per iteration (s): 1.04 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.036651E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.121 | TFLOPs: 40.84 | 15: iteration 44830/ 125429 | consumed samples: 11476480 | consumed tokens: 23503831040 | elapsed time per iteration (s): 1.09 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.025180E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.034 | TFLOPs: 38.84 | 15: iteration 44840/ 125429 | consumed samples: 11479040 | consumed tokens: 23509073920 | elapsed time per iteration (s): 1.06 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.028068E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.737 | TFLOPs: 39.78 | 15: iteration 44850/ 125429 | consumed samples: 11481600 | consumed tokens: 23514316800 | elapsed time per iteration (s): 1.05 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.016591E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.577 | TFLOPs: 40.42 | 15: iteration 44860/ 125429 | consumed samples: 11484160 | consumed tokens: 23519559680 | elapsed time per iteration (s): 1.03 | learning rate: 1.506E-04 | global batch size: 256 | lm loss: 2.032930E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.588 | TFLOPs: 40.92 | 15: iteration 44870/ 125429 | consumed samples: 11486720 | consumed tokens: 23524802560 | elapsed time per iteration (s): 1.04 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.053276E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.659 | TFLOPs: 40.76 | 15: iteration 44880/ 125429 | consumed samples: 11489280 | consumed tokens: 23530045440 | elapsed time per iteration (s): 1.04 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.050997E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.578 | TFLOPs: 40.58 | 15: iteration 44890/ 125429 | consumed samples: 11491840 | consumed tokens: 23535288320 | elapsed time per iteration (s): 1.04 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.042354E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.011 | TFLOPs: 40.82 | 15: iteration 44900/ 125429 | consumed samples: 11494400 | consumed tokens: 23540531200 | elapsed time per iteration (s): 1.05 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.042610E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.478 | TFLOPs: 40.24 | 15: iteration 44910/ 125429 | consumed samples: 11496960 | consumed tokens: 23545774080 | elapsed time per iteration (s): 1.13 | learning rate: 1.505E-04 | global batch size: 256 | lm loss: 2.017000E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.187 | TFLOPs: 37.38 | 15: iteration 44920/ 125429 | consumed samples: 11499520 | consumed tokens: 23551016960 | elapsed time per iteration (s): 1.04 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.064503E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.697 | TFLOPs: 40.77 | 15: iteration 44930/ 125429 | consumed samples: 11502080 | consumed tokens: 23556259840 | elapsed time per iteration (s): 1.05 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.025583E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.319 | TFLOPs: 40.38 | 15: iteration 44940/ 125429 | consumed samples: 11504640 | consumed tokens: 23561502720 | elapsed time per iteration (s): 1.05 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.050499E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.181 | TFLOPs: 40.35 | 15: iteration 44950/ 125429 | consumed samples: 11507200 | consumed tokens: 23566745600 | elapsed time per iteration (s): 1.05 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.029336E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.759 | TFLOPs: 40.28 | 15: iteration 44960/ 125429 | consumed samples: 11509760 | consumed tokens: 23571988480 | elapsed time per iteration (s): 1.03 | learning rate: 1.504E-04 | global batch size: 256 | lm loss: 2.040507E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.848 | TFLOPs: 40.96 | 15: iteration 44970/ 125429 | consumed samples: 11512320 | consumed tokens: 23577231360 | elapsed time per iteration (s): 1.04 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.014065E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.855 | TFLOPs: 40.63 | 15: iteration 44980/ 125429 | consumed samples: 11514880 | consumed tokens: 23582474240 | elapsed time per iteration (s): 1.03 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.031066E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.147 | TFLOPs: 41.01 | 15: iteration 44990/ 125429 | consumed samples: 11517440 | consumed tokens: 23587717120 | elapsed time per iteration (s): 1.05 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.032879E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.402 | TFLOPs: 40.39 | 15: iteration 45000/ 125429 | consumed samples: 11520000 | consumed tokens: 23592960000 | elapsed time per iteration (s): 1.04 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.008032E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.193 | TFLOPs: 40.69 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 45000 | lm loss value: 1.992236E+00 | lm loss PPL: 7.331909E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 45000 to checkpoints_1b5 0: [2022-11-26 09:13:09,007] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step45000 is begin to save! 0: [2022-11-26 09:13:09,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:13:09,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:13:09,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:13:09,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:13:09,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:13:09,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:13:09,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:13:09,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:13:09,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:13:09,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:13:09,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:13:09,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:13:09,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:13:09,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:13:09,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:13:09,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:13:09,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:13:10,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:13:10,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:13:10,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:13:10,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:13:10,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:13:10,288] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:13:10,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:13:10,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:13:10,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:13:10,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:13:10,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:13:10,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:13:10,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:13:10,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:13:10,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:13:10,807] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:13:10,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:13:10,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:13:11,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:13:11,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:13:11,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:13:11,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:13:11,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:13:11,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:13:11,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:13:11,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:13:11,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:13:11,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:13:11,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:13:11,561] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:13:11,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:13:11,667] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:13:11,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:13:11,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:13:11,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:13:11,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:13:11,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:13:11,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_29-model_00-model_states.pt... 0: [2022-11-26 09:13:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_29-model_00-model_states.pt. 0: [2022-11-26 09:13:12,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:13:12,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:13:12,198] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/layer_32-model_00-model_states.pt... 0: [2022-11-26 09:13:12,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/layer_32-model_00-model_states.pt. 0: [2022-11-26 09:13:12,203] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step45000/mp_rank_00_model_states.pt 0: [2022-11-26 09:13:12,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:13:12,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/mp_rank_00_model_states.pt. 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:13:12,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step45000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:13:12,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 09:13:12,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:13:12,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 09:13:12,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:13:12,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 14: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 09:13:12,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:13:12,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:13:12,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:13:12,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:13:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:13:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:13:12,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 09:13:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:13:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 09:13:12,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 09:13:12,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 3: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:13:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 09:13:12,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:13:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:13:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:13:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:13:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 09:13:12,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 09:13:12,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:13:12,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 3: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 09:13:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 09:13:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 09:13:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:13:12,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:13:12,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:13:12,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:13:12,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:13:12,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 7: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 09:13:12,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 09:13:12,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 09:13:12,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:13:12,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:13:12,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:13:12,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:13:12,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 15: [2022-11-26 09:13:12,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:13:12,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:13:12,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:13:12,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:13:12,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:13:12,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:13:12,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 14: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 14: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 3: [2022-11-26 09:13:12,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:13:12,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 09:13:12,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 4: [2022-11-26 09:13:12,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:13:12,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 09:13:12,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 09:13:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 09:13:12,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:13:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:13:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:13:12,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 11: [2022-11-26 09:13:12,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 09:13:12,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 5: [2022-11-26 09:13:12,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:13:12,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 09:13:12,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 12: [2022-11-26 09:13:12,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:13:12,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 09:13:12,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 2: [2022-11-26 09:13:12,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:13:12,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 09:13:12,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:13:12,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 09:13:12,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 10: [2022-11-26 09:13:12,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 9: [2022-11-26 09:13:12,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:13:12,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 09:13:12,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 09:13:12,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 09:13:12,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 8: [2022-11-26 09:13:12,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:13:12,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 09:13:12,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 13: [2022-11-26 09:13:12,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:13:12,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:13:12,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:13:12,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 09:13:12,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:13:12,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 1: [2022-11-26 09:13:12,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 6: [2022-11-26 09:13:12,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:13:12,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step45000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 09:13:12,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step45000 is ready now! 0: successfully saved checkpoint at iteration 45000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3522.55 15: iteration 45010/ 125429 | consumed samples: 11522560 | consumed tokens: 23598202880 | elapsed time per iteration (s): 1.43 | learning rate: 1.503E-04 | global batch size: 256 | lm loss: 2.027140E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.424 | TFLOPs: 29.49 | 15: iteration 45020/ 125429 | consumed samples: 11525120 | consumed tokens: 23603445760 | elapsed time per iteration (s): 1.18 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.025764E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.522 | TFLOPs: 35.78 | 15: iteration 45030/ 125429 | consumed samples: 11527680 | consumed tokens: 23608688640 | elapsed time per iteration (s): 1.07 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 1.993304E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.918 | TFLOPs: 39.48 | 15: iteration 45040/ 125429 | consumed samples: 11530240 | consumed tokens: 23613931520 | elapsed time per iteration (s): 1.14 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.046853E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.715 | TFLOPs: 37.14 | 15: iteration 45050/ 125429 | consumed samples: 11532800 | consumed tokens: 23619174400 | elapsed time per iteration (s): 1.03 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.045080E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.678 | TFLOPs: 40.93 | 15: iteration 45060/ 125429 | consumed samples: 11535360 | consumed tokens: 23624417280 | elapsed time per iteration (s): 1.04 | learning rate: 1.502E-04 | global batch size: 256 | lm loss: 2.032083E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.244 | TFLOPs: 40.69 | 15: iteration 45070/ 125429 | consumed samples: 11537920 | consumed tokens: 23629660160 | elapsed time per iteration (s): 1.06 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.011240E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.976 | TFLOPs: 39.99 | 15: iteration 45080/ 125429 | consumed samples: 11540480 | consumed tokens: 23634903040 | elapsed time per iteration (s): 1.06 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.031102E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.772 | TFLOPs: 39.95 | 15: iteration 45090/ 125429 | consumed samples: 11543040 | consumed tokens: 23640145920 | elapsed time per iteration (s): 1.04 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.030831E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.229 | TFLOPs: 40.69 | 15: iteration 45100/ 125429 | consumed samples: 11545600 | consumed tokens: 23645388800 | elapsed time per iteration (s): 1.03 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.030441E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.486 | TFLOPs: 40.90 | 15: iteration 45110/ 125429 | consumed samples: 11548160 | consumed tokens: 23650631680 | elapsed time per iteration (s): 1.03 | learning rate: 1.501E-04 | global batch size: 256 | lm loss: 2.055601E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.468 | TFLOPs: 40.90 | 15: iteration 45120/ 125429 | consumed samples: 11550720 | consumed tokens: 23655874560 | elapsed time per iteration (s): 1.15 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.033437E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.789 | TFLOPs: 36.82 | 15: iteration 45130/ 125429 | consumed samples: 11553280 | consumed tokens: 23661117440 | elapsed time per iteration (s): 1.08 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.058187E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.932 | TFLOPs: 39.15 | 15: iteration 45140/ 125429 | consumed samples: 11555840 | consumed tokens: 23666360320 | elapsed time per iteration (s): 1.18 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.021495E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.749 | TFLOPs: 35.98 | 15: iteration 45150/ 125429 | consumed samples: 11558400 | consumed tokens: 23671603200 | elapsed time per iteration (s): 1.05 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.046145E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.837 | TFLOPs: 40.46 | 15: iteration 45160/ 125429 | consumed samples: 11560960 | consumed tokens: 23676846080 | elapsed time per iteration (s): 1.07 | learning rate: 1.500E-04 | global batch size: 256 | lm loss: 2.023666E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.875 | TFLOPs: 39.48 | 15: iteration 45170/ 125429 | consumed samples: 11563520 | consumed tokens: 23682088960 | elapsed time per iteration (s): 1.12 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.027052E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.702 | TFLOPs: 37.63 | 15: iteration 45180/ 125429 | consumed samples: 11566080 | consumed tokens: 23687331840 | elapsed time per iteration (s): 1.24 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.040238E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 206.840 | TFLOPs: 34.18 | 15: iteration 45190/ 125429 | consumed samples: 11568640 | consumed tokens: 23692574720 | elapsed time per iteration (s): 1.13 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.018041E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.076 | TFLOPs: 37.36 | 15: iteration 45200/ 125429 | consumed samples: 11571200 | consumed tokens: 23697817600 | elapsed time per iteration (s): 1.16 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.025842E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.091 | TFLOPs: 36.54 | 15: iteration 45210/ 125429 | consumed samples: 11573760 | consumed tokens: 23703060480 | elapsed time per iteration (s): 1.13 | learning rate: 1.499E-04 | global batch size: 256 | lm loss: 2.006500E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.887 | TFLOPs: 37.49 | 15: iteration 45220/ 125429 | consumed samples: 11576320 | consumed tokens: 23708303360 | elapsed time per iteration (s): 1.13 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.048761E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.545 | TFLOPs: 37.60 | 15: iteration 45230/ 125429 | consumed samples: 11578880 | consumed tokens: 23713546240 | elapsed time per iteration (s): 1.04 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.034697E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.966 | TFLOPs: 40.81 | 15: iteration 45240/ 125429 | consumed samples: 11581440 | consumed tokens: 23718789120 | elapsed time per iteration (s): 1.07 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.020286E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.766 | TFLOPs: 39.46 | 15: iteration 45250/ 125429 | consumed samples: 11584000 | consumed tokens: 23724032000 | elapsed time per iteration (s): 1.14 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.014060E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.402 | TFLOPs: 37.25 | 15: iteration 45260/ 125429 | consumed samples: 11586560 | consumed tokens: 23729274880 | elapsed time per iteration (s): 1.04 | learning rate: 1.498E-04 | global batch size: 256 | lm loss: 2.030466E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.161 | TFLOPs: 40.51 | 15: iteration 45270/ 125429 | consumed samples: 11589120 | consumed tokens: 23734517760 | elapsed time per iteration (s): 1.04 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.052206E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.003 | TFLOPs: 40.49 | 15: iteration 45280/ 125429 | consumed samples: 11591680 | consumed tokens: 23739760640 | elapsed time per iteration (s): 1.03 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.014378E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.385 | TFLOPs: 40.88 | 15: iteration 45290/ 125429 | consumed samples: 11594240 | consumed tokens: 23745003520 | elapsed time per iteration (s): 1.09 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.048816E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.235 | TFLOPs: 38.87 | 15: iteration 45300/ 125429 | consumed samples: 11596800 | consumed tokens: 23750246400 | elapsed time per iteration (s): 1.03 | learning rate: 1.497E-04 | global batch size: 256 | lm loss: 2.042970E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.216 | TFLOPs: 41.02 | 15: iteration 45310/ 125429 | consumed samples: 11599360 | consumed tokens: 23755489280 | elapsed time per iteration (s): 1.04 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 1.995556E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.810 | TFLOPs: 40.79 | 15: iteration 45320/ 125429 | consumed samples: 11601920 | consumed tokens: 23760732160 | elapsed time per iteration (s): 1.05 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.019266E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.349 | TFLOPs: 40.38 | 15: iteration 45330/ 125429 | consumed samples: 11604480 | consumed tokens: 23765975040 | elapsed time per iteration (s): 1.06 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.026013E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.264 | TFLOPs: 40.04 | 15: iteration 45340/ 125429 | consumed samples: 11607040 | consumed tokens: 23771217920 | elapsed time per iteration (s): 1.05 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.032132E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.041 | TFLOPs: 40.16 | 15: iteration 45350/ 125429 | consumed samples: 11609600 | consumed tokens: 23776460800 | elapsed time per iteration (s): 1.07 | learning rate: 1.496E-04 | global batch size: 256 | lm loss: 2.012125E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.912 | TFLOPs: 39.48 | 15: iteration 45360/ 125429 | consumed samples: 11612160 | consumed tokens: 23781703680 | elapsed time per iteration (s): 1.07 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.031728E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.721 | TFLOPs: 39.62 | 15: iteration 45370/ 125429 | consumed samples: 11614720 | consumed tokens: 23786946560 | elapsed time per iteration (s): 1.05 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.019396E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.909 | TFLOPs: 40.14 | 15: iteration 45380/ 125429 | consumed samples: 11617280 | consumed tokens: 23792189440 | elapsed time per iteration (s): 1.04 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.057124E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.368 | TFLOPs: 40.71 | 15: iteration 45390/ 125429 | consumed samples: 11619840 | consumed tokens: 23797432320 | elapsed time per iteration (s): 1.05 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.084586E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.889 | TFLOPs: 40.47 | 15: iteration 45400/ 125429 | consumed samples: 11622400 | consumed tokens: 23802675200 | elapsed time per iteration (s): 1.06 | learning rate: 1.495E-04 | global batch size: 256 | lm loss: 2.025338E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.456 | TFLOPs: 40.07 | 15: iteration 45410/ 125429 | consumed samples: 11624960 | consumed tokens: 23807918080 | elapsed time per iteration (s): 1.03 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.019365E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.687 | TFLOPs: 41.26 | 15: iteration 45420/ 125429 | consumed samples: 11627520 | consumed tokens: 23813160960 | elapsed time per iteration (s): 1.07 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.034573E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.789 | TFLOPs: 39.63 | 15: iteration 45430/ 125429 | consumed samples: 11630080 | consumed tokens: 23818403840 | elapsed time per iteration (s): 1.06 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.018885E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.110 | TFLOPs: 39.85 | 15: iteration 45440/ 125429 | consumed samples: 11632640 | consumed tokens: 23823646720 | elapsed time per iteration (s): 1.27 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.026022E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 201.188 | TFLOPs: 33.25 | 15: iteration 45450/ 125429 | consumed samples: 11635200 | consumed tokens: 23828889600 | elapsed time per iteration (s): 1.07 | learning rate: 1.494E-04 | global batch size: 256 | lm loss: 2.071383E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.162 | TFLOPs: 39.52 | 15: iteration 45460/ 125429 | consumed samples: 11637760 | consumed tokens: 23834132480 | elapsed time per iteration (s): 1.06 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.019565E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.531 | TFLOPs: 39.91 | 15: iteration 45470/ 125429 | consumed samples: 11640320 | consumed tokens: 23839375360 | elapsed time per iteration (s): 1.05 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.015944E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.606 | TFLOPs: 40.42 | 15: iteration 45480/ 125429 | consumed samples: 11642880 | consumed tokens: 23844618240 | elapsed time per iteration (s): 1.08 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.064648E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.635 | TFLOPs: 39.11 | 15: iteration 45490/ 125429 | consumed samples: 11645440 | consumed tokens: 23849861120 | elapsed time per iteration (s): 1.06 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.029102E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.370 | TFLOPs: 40.05 | 15: iteration 45500/ 125429 | consumed samples: 11648000 | consumed tokens: 23855104000 | elapsed time per iteration (s): 1.06 | learning rate: 1.493E-04 | global batch size: 256 | lm loss: 2.032214E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.288 | TFLOPs: 39.87 | 15: iteration 45510/ 125429 | consumed samples: 11650560 | consumed tokens: 23860346880 | elapsed time per iteration (s): 1.09 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.040997E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.477 | TFLOPs: 38.75 | 15: iteration 45520/ 125429 | consumed samples: 11653120 | consumed tokens: 23865589760 | elapsed time per iteration (s): 1.06 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.035674E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.375 | TFLOPs: 39.89 | 15: iteration 45530/ 125429 | consumed samples: 11655680 | consumed tokens: 23870832640 | elapsed time per iteration (s): 1.09 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.022523E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.587 | TFLOPs: 38.77 | 15: iteration 45540/ 125429 | consumed samples: 11658240 | consumed tokens: 23876075520 | elapsed time per iteration (s): 1.07 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.007357E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.431 | TFLOPs: 39.57 | 15: iteration 45550/ 125429 | consumed samples: 11660800 | consumed tokens: 23881318400 | elapsed time per iteration (s): 1.04 | learning rate: 1.492E-04 | global batch size: 256 | lm loss: 2.025145E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.151 | TFLOPs: 40.84 | 15: iteration 45560/ 125429 | consumed samples: 11663360 | consumed tokens: 23886561280 | elapsed time per iteration (s): 1.06 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.056208E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.038 | TFLOPs: 40.00 | 15: iteration 45570/ 125429 | consumed samples: 11665920 | consumed tokens: 23891804160 | elapsed time per iteration (s): 1.03 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.027734E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.562 | TFLOPs: 41.24 | 15: iteration 45580/ 125429 | consumed samples: 11668480 | consumed tokens: 23897047040 | elapsed time per iteration (s): 1.09 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.048976E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.792 | TFLOPs: 38.80 | 15: iteration 45590/ 125429 | consumed samples: 11671040 | consumed tokens: 23902289920 | elapsed time per iteration (s): 1.07 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.033178E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.949 | TFLOPs: 39.65 | 15: iteration 45600/ 125429 | consumed samples: 11673600 | consumed tokens: 23907532800 | elapsed time per iteration (s): 1.05 | learning rate: 1.491E-04 | global batch size: 256 | lm loss: 2.024377E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.947 | TFLOPs: 40.48 | 15: iteration 45610/ 125429 | consumed samples: 11676160 | consumed tokens: 23912775680 | elapsed time per iteration (s): 1.03 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.017756E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.871 | TFLOPs: 41.13 | 15: iteration 45620/ 125429 | consumed samples: 11678720 | consumed tokens: 23918018560 | elapsed time per iteration (s): 1.05 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.007781E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.741 | TFLOPs: 40.28 | 15: iteration 45630/ 125429 | consumed samples: 11681280 | consumed tokens: 23923261440 | elapsed time per iteration (s): 1.10 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.019240E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.895 | TFLOPs: 38.32 | 15: iteration 45640/ 125429 | consumed samples: 11683840 | consumed tokens: 23928504320 | elapsed time per iteration (s): 1.02 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.057455E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.061 | TFLOPs: 41.49 | 15: iteration 45650/ 125429 | consumed samples: 11686400 | consumed tokens: 23933747200 | elapsed time per iteration (s): 1.04 | learning rate: 1.490E-04 | global batch size: 256 | lm loss: 2.022009E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.012 | TFLOPs: 40.66 | 15: iteration 45660/ 125429 | consumed samples: 11688960 | consumed tokens: 23938990080 | elapsed time per iteration (s): 1.06 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.040638E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.547 | TFLOPs: 39.75 | 15: iteration 45670/ 125429 | consumed samples: 11691520 | consumed tokens: 23944232960 | elapsed time per iteration (s): 1.03 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.039117E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.917 | TFLOPs: 41.14 | 15: iteration 45680/ 125429 | consumed samples: 11694080 | consumed tokens: 23949475840 | elapsed time per iteration (s): 1.03 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 1.995300E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.900 | TFLOPs: 41.13 | 15: iteration 45690/ 125429 | consumed samples: 11696640 | consumed tokens: 23954718720 | elapsed time per iteration (s): 1.05 | learning rate: 1.489E-04 | global batch size: 256 | lm loss: 2.047794E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.101 | TFLOPs: 40.17 | 15: iteration 45700/ 125429 | consumed samples: 11699200 | consumed tokens: 23959961600 | elapsed time per iteration (s): 1.06 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.037210E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.550 | TFLOPs: 39.92 | 15: iteration 45710/ 125429 | consumed samples: 11701760 | consumed tokens: 23965204480 | elapsed time per iteration (s): 1.03 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 1.996147E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.577 | TFLOPs: 40.91 | 15: iteration 45720/ 125429 | consumed samples: 11704320 | consumed tokens: 23970447360 | elapsed time per iteration (s): 1.08 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.013659E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.365 | TFLOPs: 39.06 | 15: iteration 45730/ 125429 | consumed samples: 11706880 | consumed tokens: 23975690240 | elapsed time per iteration (s): 1.05 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.006121E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.623 | TFLOPs: 40.43 | 15: iteration 45740/ 125429 | consumed samples: 11709440 | consumed tokens: 23980933120 | elapsed time per iteration (s): 1.07 | learning rate: 1.488E-04 | global batch size: 256 | lm loss: 2.013837E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.904 | TFLOPs: 39.65 | 15: iteration 45750/ 125429 | consumed samples: 11712000 | consumed tokens: 23986176000 | elapsed time per iteration (s): 1.02 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.039628E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.150 | TFLOPs: 41.50 | 15: iteration 45760/ 125429 | consumed samples: 11714560 | consumed tokens: 23991418880 | elapsed time per iteration (s): 1.05 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.026986E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.064 | TFLOPs: 40.33 | 15: iteration 45770/ 125429 | consumed samples: 11717120 | consumed tokens: 23996661760 | elapsed time per iteration (s): 1.03 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.014070E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.546 | TFLOPs: 40.91 | 15: iteration 45780/ 125429 | consumed samples: 11719680 | consumed tokens: 24001904640 | elapsed time per iteration (s): 1.10 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.058937E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.722 | TFLOPs: 38.46 | 15: iteration 45790/ 125429 | consumed samples: 11722240 | consumed tokens: 24007147520 | elapsed time per iteration (s): 1.05 | learning rate: 1.487E-04 | global batch size: 256 | lm loss: 2.055960E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.209 | TFLOPs: 40.19 | 15: iteration 45800/ 125429 | consumed samples: 11724800 | consumed tokens: 24012390400 | elapsed time per iteration (s): 1.05 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.045601E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.592 | TFLOPs: 40.42 | 15: iteration 45810/ 125429 | consumed samples: 11727360 | consumed tokens: 24017633280 | elapsed time per iteration (s): 1.04 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.018044E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.979 | TFLOPs: 40.48 | 15: iteration 45820/ 125429 | consumed samples: 11729920 | consumed tokens: 24022876160 | elapsed time per iteration (s): 1.04 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.032356E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.952 | TFLOPs: 40.65 | 15: iteration 45830/ 125429 | consumed samples: 11732480 | consumed tokens: 24028119040 | elapsed time per iteration (s): 1.08 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.045601E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.620 | TFLOPs: 39.10 | 15: iteration 45840/ 125429 | consumed samples: 11735040 | consumed tokens: 24033361920 | elapsed time per iteration (s): 1.05 | learning rate: 1.486E-04 | global batch size: 256 | lm loss: 2.016924E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.789 | TFLOPs: 40.12 | 15: iteration 45850/ 125429 | consumed samples: 11737600 | consumed tokens: 24038604800 | elapsed time per iteration (s): 1.07 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.041687E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.291 | TFLOPs: 39.38 | 15: iteration 45860/ 125429 | consumed samples: 11740160 | consumed tokens: 24043847680 | elapsed time per iteration (s): 1.10 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.009524E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.823 | TFLOPs: 38.31 | 15: iteration 45870/ 125429 | consumed samples: 11742720 | consumed tokens: 24049090560 | elapsed time per iteration (s): 1.04 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.031366E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.510 | TFLOPs: 40.57 | 15: iteration 45880/ 125429 | consumed samples: 11745280 | consumed tokens: 24054333440 | elapsed time per iteration (s): 1.03 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.037662E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.712 | TFLOPs: 41.27 | 15: iteration 45890/ 125429 | consumed samples: 11747840 | consumed tokens: 24059576320 | elapsed time per iteration (s): 1.03 | learning rate: 1.485E-04 | global batch size: 256 | lm loss: 2.066569E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.972 | TFLOPs: 40.98 | 15: iteration 45900/ 125429 | consumed samples: 11750400 | consumed tokens: 24064819200 | elapsed time per iteration (s): 1.04 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.040146E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.563 | TFLOPs: 40.58 | 15: iteration 45910/ 125429 | consumed samples: 11752960 | consumed tokens: 24070062080 | elapsed time per iteration (s): 1.06 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.043488E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.422 | TFLOPs: 39.90 | 15: iteration 45920/ 125429 | consumed samples: 11755520 | consumed tokens: 24075304960 | elapsed time per iteration (s): 1.08 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.031415E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.039 | TFLOPs: 39.34 | 15: iteration 45930/ 125429 | consumed samples: 11758080 | consumed tokens: 24080547840 | elapsed time per iteration (s): 1.05 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.024165E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.840 | TFLOPs: 40.46 | 15: iteration 45940/ 125429 | consumed samples: 11760640 | consumed tokens: 24085790720 | elapsed time per iteration (s): 1.10 | learning rate: 1.484E-04 | global batch size: 256 | lm loss: 2.036046E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.273 | TFLOPs: 38.55 | 15: iteration 45950/ 125429 | consumed samples: 11763200 | consumed tokens: 24091033600 | elapsed time per iteration (s): 1.03 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.041870E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.590 | TFLOPs: 40.92 | 15: iteration 45960/ 125429 | consumed samples: 11765760 | consumed tokens: 24096276480 | elapsed time per iteration (s): 1.08 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.041193E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.799 | TFLOPs: 39.13 | 15: iteration 45970/ 125429 | consumed samples: 11768320 | consumed tokens: 24101519360 | elapsed time per iteration (s): 1.10 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.016197E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.654 | TFLOPs: 38.45 | 15: iteration 45980/ 125429 | consumed samples: 11770880 | consumed tokens: 24106762240 | elapsed time per iteration (s): 1.10 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.029878E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.566 | TFLOPs: 38.60 | 15: iteration 45990/ 125429 | consumed samples: 11773440 | consumed tokens: 24112005120 | elapsed time per iteration (s): 1.07 | learning rate: 1.483E-04 | global batch size: 256 | lm loss: 2.024808E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.130 | TFLOPs: 39.52 | 0: [2022-11-26 09:31:01,257] [INFO] [logging.py:68:log_dist] [Rank 0] step=46000, skipped=0, lr=[0.00014823194670481252, 0.00014823194670481252, 0.00014823194670481252], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 46000/ 125429 | consumed samples: 11776000 | consumed tokens: 24117248000 | elapsed time per iteration (s): 1.04 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.057278E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.399 | TFLOPs: 40.72 | 0: steps: 46000 loss: 2.1614 iter time (s): 1.056 samples/sec: 242.351 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 46000 | lm loss value: 2.036188E+00 | lm loss PPL: 7.661351E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 46000 to checkpoints_1b5 0: [2022-11-26 09:31:01,607] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step46000 is begin to save! 0: [2022-11-26 09:31:01,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:31:01,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:31:01,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:31:02,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:31:02,014] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:31:02,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:31:02,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:31:02,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:31:02,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:31:02,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:31:02,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:31:02,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:31:02,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:31:02,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:31:02,569] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:31:02,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:31:02,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:31:02,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:31:02,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:31:02,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:31:02,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:31:03,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:31:03,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:31:03,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:31:03,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:31:03,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:31:03,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:31:03,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:31:03,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:31:03,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:31:03,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:31:03,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:31:03,544] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:31:03,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:31:03,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:31:03,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:31:03,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:31:03,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:31:03,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:31:03,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:31:03,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:31:04,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:31:04,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:31:04,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:31:04,177] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:31:04,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:31:04,279] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:31:04,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:31:04,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:31:04,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:31:04,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:31:04,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:31:04,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:31:04,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:31:04,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_29-model_00-model_states.pt... 0: [2022-11-26 09:31:04,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_29-model_00-model_states.pt. 0: [2022-11-26 09:31:04,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:31:04,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:31:04,908] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/layer_32-model_00-model_states.pt... 0: [2022-11-26 09:31:04,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/layer_32-model_00-model_states.pt. 0: [2022-11-26 09:31:04,913] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step46000/mp_rank_00_model_states.pt 0: [2022-11-26 09:31:04,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:31:04,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/mp_rank_00_model_states.pt. 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:31:04,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step46000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:31:05,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 09:31:05,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 09:31:05,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 09:31:05,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 09:31:05,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 09:31:05,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:31:05,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 09:31:05,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 09:31:05,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 09:31:05,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:31:05,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 09:31:05,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 09:31:05,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 09:31:05,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 09:31:05,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 09:31:05,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:31:05,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 09:31:05,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 09:31:05,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 09:31:05,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 09:31:05,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 09:31:05,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 09:31:05,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:31:05,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:31:05,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 09:31:05,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:31:05,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:31:05,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 09:31:05,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 09:31:05,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 09:31:05,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:31:05,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:31:05,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 09:31:05,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 7: [2022-11-26 09:31:05,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:31:05,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:31:05,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:31:05,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 09:31:05,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 5: [2022-11-26 09:31:05,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:31:05,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 09:31:05,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 3: [2022-11-26 09:31:05,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:31:05,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 4: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:31:05,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:31:05,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 10: [2022-11-26 09:31:05,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:31:05,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 09:31:05,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 09:31:05,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 13: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:31:05,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 6: [2022-11-26 09:31:05,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:31:05,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 11: [2022-11-26 09:31:05,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:31:05,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 09:31:05,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 9: [2022-11-26 09:31:05,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:31:05,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 09:31:05,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 1: [2022-11-26 09:31:05,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:31:05,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:31:05,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 8: [2022-11-26 09:31:05,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:31:05,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 09:31:05,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 0: [2022-11-26 09:31:05,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 09:31:05,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 09:31:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 2: [2022-11-26 09:31:05,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 14: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:31:05,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 09:31:05,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 12: [2022-11-26 09:31:05,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 0: successfully saved checkpoint at iteration 46000 to checkpoints_1b5 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step46000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: [2022-11-26 09:31:05,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step46000 is ready now! 15: time (ms) | save-checkpoint: 3654.28 15: iteration 46010/ 125429 | consumed samples: 11778560 | consumed tokens: 24122490880 | elapsed time per iteration (s): 1.44 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.005749E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.483 | TFLOPs: 29.33 | 15: iteration 46020/ 125429 | consumed samples: 11781120 | consumed tokens: 24127733760 | elapsed time per iteration (s): 1.03 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.028920E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.449 | TFLOPs: 41.06 | 15: iteration 46030/ 125429 | consumed samples: 11783680 | consumed tokens: 24132976640 | elapsed time per iteration (s): 1.05 | learning rate: 1.482E-04 | global batch size: 256 | lm loss: 2.018533E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.723 | TFLOPs: 40.44 | 15: iteration 46040/ 125429 | consumed samples: 11786240 | consumed tokens: 24138219520 | elapsed time per iteration (s): 1.07 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.045680E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.168 | TFLOPs: 39.69 | 15: iteration 46050/ 125429 | consumed samples: 11788800 | consumed tokens: 24143462400 | elapsed time per iteration (s): 1.09 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.028230E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.862 | TFLOPs: 38.81 | 15: iteration 46060/ 125429 | consumed samples: 11791360 | consumed tokens: 24148705280 | elapsed time per iteration (s): 1.03 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.044910E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.464 | TFLOPs: 41.06 | 15: iteration 46070/ 125429 | consumed samples: 11793920 | consumed tokens: 24153948160 | elapsed time per iteration (s): 1.07 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.000430E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.045 | TFLOPs: 39.67 | 15: iteration 46080/ 125429 | consumed samples: 11796480 | consumed tokens: 24159191040 | elapsed time per iteration (s): 1.07 | learning rate: 1.481E-04 | global batch size: 256 | lm loss: 2.024589E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.173 | TFLOPs: 39.36 | 15: iteration 46090/ 125429 | consumed samples: 11799040 | consumed tokens: 24164433920 | elapsed time per iteration (s): 1.06 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.026439E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.267 | TFLOPs: 40.04 | 15: iteration 46100/ 125429 | consumed samples: 11801600 | consumed tokens: 24169676800 | elapsed time per iteration (s): 1.05 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.030406E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.388 | TFLOPs: 40.39 | 15: iteration 46110/ 125429 | consumed samples: 11804160 | consumed tokens: 24174919680 | elapsed time per iteration (s): 1.04 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.025622E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.024 | TFLOPs: 40.49 | 15: iteration 46120/ 125429 | consumed samples: 11806720 | consumed tokens: 24180162560 | elapsed time per iteration (s): 2.27 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.043637E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 112.756 | TFLOPs: 18.63 | 15: iteration 46130/ 125429 | consumed samples: 11809280 | consumed tokens: 24185405440 | elapsed time per iteration (s): 1.05 | learning rate: 1.480E-04 | global batch size: 256 | lm loss: 2.043355E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.821 | TFLOPs: 40.46 | 15: iteration 46140/ 125429 | consumed samples: 11811840 | consumed tokens: 24190648320 | elapsed time per iteration (s): 1.06 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.034147E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.615 | TFLOPs: 39.76 | 15: iteration 46150/ 125429 | consumed samples: 11814400 | consumed tokens: 24195891200 | elapsed time per iteration (s): 1.06 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.024730E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.583 | TFLOPs: 39.76 | 15: iteration 46160/ 125429 | consumed samples: 11816960 | consumed tokens: 24201134080 | elapsed time per iteration (s): 1.04 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.027546E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.567 | TFLOPs: 40.58 | 15: iteration 46170/ 125429 | consumed samples: 11819520 | consumed tokens: 24206376960 | elapsed time per iteration (s): 1.05 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.025952E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.561 | TFLOPs: 40.42 | 15: iteration 46180/ 125429 | consumed samples: 11822080 | consumed tokens: 24211619840 | elapsed time per iteration (s): 1.03 | learning rate: 1.479E-04 | global batch size: 256 | lm loss: 2.010182E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.539 | TFLOPs: 40.91 | 15: iteration 46190/ 125429 | consumed samples: 11824640 | consumed tokens: 24216862720 | elapsed time per iteration (s): 1.04 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.028136E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.270 | TFLOPs: 40.53 | 15: iteration 46200/ 125429 | consumed samples: 11827200 | consumed tokens: 24222105600 | elapsed time per iteration (s): 1.03 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.020110E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.675 | TFLOPs: 41.10 | 15: iteration 46210/ 125429 | consumed samples: 11829760 | consumed tokens: 24227348480 | elapsed time per iteration (s): 1.05 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.074905E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.322 | TFLOPs: 40.38 | 15: iteration 46220/ 125429 | consumed samples: 11832320 | consumed tokens: 24232591360 | elapsed time per iteration (s): 1.03 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.017731E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.625 | TFLOPs: 40.92 | 15: iteration 46230/ 125429 | consumed samples: 11834880 | consumed tokens: 24237834240 | elapsed time per iteration (s): 1.04 | learning rate: 1.478E-04 | global batch size: 256 | lm loss: 2.013280E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.160 | TFLOPs: 40.68 | 15: iteration 46240/ 125429 | consumed samples: 11837440 | consumed tokens: 24243077120 | elapsed time per iteration (s): 1.07 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.049356E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.985 | TFLOPs: 39.66 | 15: iteration 46250/ 125429 | consumed samples: 11840000 | consumed tokens: 24248320000 | elapsed time per iteration (s): 1.05 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.011786E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.761 | TFLOPs: 40.28 | 15: iteration 46260/ 125429 | consumed samples: 11842560 | consumed tokens: 24253562880 | elapsed time per iteration (s): 1.07 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.027357E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.535 | TFLOPs: 39.42 | 15: iteration 46270/ 125429 | consumed samples: 11845120 | consumed tokens: 24258805760 | elapsed time per iteration (s): 1.05 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.007176E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.748 | TFLOPs: 40.45 | 15: iteration 46280/ 125429 | consumed samples: 11847680 | consumed tokens: 24264048640 | elapsed time per iteration (s): 1.03 | learning rate: 1.477E-04 | global batch size: 256 | lm loss: 2.020603E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.684 | TFLOPs: 41.10 | 15: iteration 46290/ 125429 | consumed samples: 11850240 | consumed tokens: 24269291520 | elapsed time per iteration (s): 1.04 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.041019E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.124 | TFLOPs: 40.51 | 15: iteration 46300/ 125429 | consumed samples: 11852800 | consumed tokens: 24274534400 | elapsed time per iteration (s): 1.04 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 1.976930E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.397 | TFLOPs: 40.55 | 15: iteration 46310/ 125429 | consumed samples: 11855360 | consumed tokens: 24279777280 | elapsed time per iteration (s): 1.05 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.000951E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.415 | TFLOPs: 40.39 | 15: iteration 46320/ 125429 | consumed samples: 11857920 | consumed tokens: 24285020160 | elapsed time per iteration (s): 1.06 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.022418E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.527 | TFLOPs: 39.75 | 15: iteration 46330/ 125429 | consumed samples: 11860480 | consumed tokens: 24290263040 | elapsed time per iteration (s): 1.08 | learning rate: 1.476E-04 | global batch size: 256 | lm loss: 2.025348E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.979 | TFLOPs: 39.16 | 15: iteration 46340/ 125429 | consumed samples: 11863040 | consumed tokens: 24295505920 | elapsed time per iteration (s): 1.08 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.030022E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.294 | TFLOPs: 39.05 | 15: iteration 46350/ 125429 | consumed samples: 11865600 | consumed tokens: 24300748800 | elapsed time per iteration (s): 1.06 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.032184E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.698 | TFLOPs: 39.78 | 15: iteration 46360/ 125429 | consumed samples: 11868160 | consumed tokens: 24305991680 | elapsed time per iteration (s): 1.05 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.043590E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.607 | TFLOPs: 40.26 | 15: iteration 46370/ 125429 | consumed samples: 11870720 | consumed tokens: 24311234560 | elapsed time per iteration (s): 1.04 | learning rate: 1.475E-04 | global batch size: 256 | lm loss: 2.027624E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.825 | TFLOPs: 40.79 | 15: iteration 46380/ 125429 | consumed samples: 11873280 | consumed tokens: 24316477440 | elapsed time per iteration (s): 1.05 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.013892E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.799 | TFLOPs: 40.29 | 15: iteration 46390/ 125429 | consumed samples: 11875840 | consumed tokens: 24321720320 | elapsed time per iteration (s): 1.02 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.022939E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.414 | TFLOPs: 41.38 | 15: iteration 46400/ 125429 | consumed samples: 11878400 | consumed tokens: 24326963200 | elapsed time per iteration (s): 1.05 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.052982E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.840 | TFLOPs: 40.13 | 15: iteration 46410/ 125429 | consumed samples: 11880960 | consumed tokens: 24332206080 | elapsed time per iteration (s): 1.05 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.042276E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.411 | TFLOPs: 40.23 | 15: iteration 46420/ 125429 | consumed samples: 11883520 | consumed tokens: 24337448960 | elapsed time per iteration (s): 1.03 | learning rate: 1.474E-04 | global batch size: 256 | lm loss: 2.016843E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.349 | TFLOPs: 40.88 | 15: iteration 46430/ 125429 | consumed samples: 11886080 | consumed tokens: 24342691840 | elapsed time per iteration (s): 1.03 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.053314E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.528 | TFLOPs: 41.24 | 15: iteration 46440/ 125429 | consumed samples: 11888640 | consumed tokens: 24347934720 | elapsed time per iteration (s): 1.06 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.036553E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.398 | TFLOPs: 39.89 | 15: iteration 46450/ 125429 | consumed samples: 11891200 | consumed tokens: 24353177600 | elapsed time per iteration (s): 1.04 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.026785E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.738 | TFLOPs: 40.61 | 15: iteration 46460/ 125429 | consumed samples: 11893760 | consumed tokens: 24358420480 | elapsed time per iteration (s): 1.06 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.048243E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.831 | TFLOPs: 39.80 | 15: iteration 46470/ 125429 | consumed samples: 11896320 | consumed tokens: 24363663360 | elapsed time per iteration (s): 1.07 | learning rate: 1.473E-04 | global batch size: 256 | lm loss: 2.064709E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.170 | TFLOPs: 39.36 | 15: iteration 46480/ 125429 | consumed samples: 11898880 | consumed tokens: 24368906240 | elapsed time per iteration (s): 1.04 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.043049E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.199 | TFLOPs: 40.69 | 15: iteration 46490/ 125429 | consumed samples: 11901440 | consumed tokens: 24374149120 | elapsed time per iteration (s): 1.03 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.046822E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.348 | TFLOPs: 40.88 | 15: iteration 46500/ 125429 | consumed samples: 11904000 | consumed tokens: 24379392000 | elapsed time per iteration (s): 1.02 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.032811E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.577 | TFLOPs: 41.41 | 15: iteration 46510/ 125429 | consumed samples: 11906560 | consumed tokens: 24384634880 | elapsed time per iteration (s): 1.03 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.027033E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.089 | TFLOPs: 41.00 | 15: iteration 46520/ 125429 | consumed samples: 11909120 | consumed tokens: 24389877760 | elapsed time per iteration (s): 1.05 | learning rate: 1.472E-04 | global batch size: 256 | lm loss: 2.054583E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.655 | TFLOPs: 40.10 | 15: iteration 46530/ 125429 | consumed samples: 11911680 | consumed tokens: 24395120640 | elapsed time per iteration (s): 1.02 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.042216E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.576 | TFLOPs: 41.41 | 15: iteration 46540/ 125429 | consumed samples: 11914240 | consumed tokens: 24400363520 | elapsed time per iteration (s): 1.03 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.052851E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.490 | TFLOPs: 40.90 | 15: iteration 46550/ 125429 | consumed samples: 11916800 | consumed tokens: 24405606400 | elapsed time per iteration (s): 1.04 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.030040E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.342 | TFLOPs: 40.88 | 15: iteration 46560/ 125429 | consumed samples: 11919360 | consumed tokens: 24410849280 | elapsed time per iteration (s): 1.08 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.018099E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.193 | TFLOPs: 39.03 | 15: iteration 46570/ 125429 | consumed samples: 11921920 | consumed tokens: 24416092160 | elapsed time per iteration (s): 1.04 | learning rate: 1.471E-04 | global batch size: 256 | lm loss: 2.029665E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.004 | TFLOPs: 40.65 | 15: iteration 46580/ 125429 | consumed samples: 11924480 | consumed tokens: 24421335040 | elapsed time per iteration (s): 1.05 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.045284E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.443 | TFLOPs: 40.40 | 15: iteration 46590/ 125429 | consumed samples: 11927040 | consumed tokens: 24426577920 | elapsed time per iteration (s): 1.08 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.030277E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.927 | TFLOPs: 39.32 | 15: iteration 46600/ 125429 | consumed samples: 11929600 | consumed tokens: 24431820800 | elapsed time per iteration (s): 1.05 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.026891E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.881 | TFLOPs: 40.14 | 15: iteration 46610/ 125429 | consumed samples: 11932160 | consumed tokens: 24437063680 | elapsed time per iteration (s): 1.03 | learning rate: 1.470E-04 | global batch size: 256 | lm loss: 2.015552E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.596 | TFLOPs: 40.92 | 15: iteration 46620/ 125429 | consumed samples: 11934720 | consumed tokens: 24442306560 | elapsed time per iteration (s): 1.05 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.031073E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.739 | TFLOPs: 40.11 | 15: iteration 46630/ 125429 | consumed samples: 11937280 | consumed tokens: 24447549440 | elapsed time per iteration (s): 1.03 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.035481E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.829 | TFLOPs: 40.96 | 15: iteration 46640/ 125429 | consumed samples: 11939840 | consumed tokens: 24452792320 | elapsed time per iteration (s): 1.04 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.039567E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.367 | TFLOPs: 40.55 | 15: iteration 46650/ 125429 | consumed samples: 11942400 | consumed tokens: 24458035200 | elapsed time per iteration (s): 1.02 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.004842E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.966 | TFLOPs: 41.47 | 15: iteration 46660/ 125429 | consumed samples: 11944960 | consumed tokens: 24463278080 | elapsed time per iteration (s): 1.05 | learning rate: 1.469E-04 | global batch size: 256 | lm loss: 2.021443E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.764 | TFLOPs: 40.28 | 15: iteration 46670/ 125429 | consumed samples: 11947520 | consumed tokens: 24468520960 | elapsed time per iteration (s): 1.03 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 1.997304E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.962 | TFLOPs: 40.98 | 15: iteration 46680/ 125429 | consumed samples: 11950080 | consumed tokens: 24473763840 | elapsed time per iteration (s): 1.05 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.023706E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.045 | TFLOPs: 40.17 | 15: iteration 46690/ 125429 | consumed samples: 11952640 | consumed tokens: 24479006720 | elapsed time per iteration (s): 1.08 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.047998E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.718 | TFLOPs: 39.12 | 15: iteration 46700/ 125429 | consumed samples: 11955200 | consumed tokens: 24484249600 | elapsed time per iteration (s): 1.04 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.013251E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.856 | TFLOPs: 40.79 | 15: iteration 46710/ 125429 | consumed samples: 11957760 | consumed tokens: 24489492480 | elapsed time per iteration (s): 1.09 | learning rate: 1.468E-04 | global batch size: 256 | lm loss: 2.015147E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.852 | TFLOPs: 38.81 | 15: iteration 46720/ 125429 | consumed samples: 11960320 | consumed tokens: 24494735360 | elapsed time per iteration (s): 1.06 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 1.997056E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.400 | TFLOPs: 39.89 | 15: iteration 46730/ 125429 | consumed samples: 11962880 | consumed tokens: 24499978240 | elapsed time per iteration (s): 1.03 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.040962E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.239 | TFLOPs: 41.19 | 15: iteration 46740/ 125429 | consumed samples: 11965440 | consumed tokens: 24505221120 | elapsed time per iteration (s): 1.04 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.018349E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.143 | TFLOPs: 40.68 | 15: iteration 46750/ 125429 | consumed samples: 11968000 | consumed tokens: 24510464000 | elapsed time per iteration (s): 1.03 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 2.022650E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.544 | TFLOPs: 40.91 | 15: iteration 46760/ 125429 | consumed samples: 11970560 | consumed tokens: 24515706880 | elapsed time per iteration (s): 1.05 | learning rate: 1.467E-04 | global batch size: 256 | lm loss: 1.998926E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.777 | TFLOPs: 40.29 | 15: iteration 46770/ 125429 | consumed samples: 11973120 | consumed tokens: 24520949760 | elapsed time per iteration (s): 1.07 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.043670E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.846 | TFLOPs: 39.64 | 15: iteration 46780/ 125429 | consumed samples: 11975680 | consumed tokens: 24526192640 | elapsed time per iteration (s): 1.04 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.027787E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.409 | TFLOPs: 40.56 | 15: iteration 46790/ 125429 | consumed samples: 11978240 | consumed tokens: 24531435520 | elapsed time per iteration (s): 1.07 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.026654E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.082 | TFLOPs: 39.68 | 15: iteration 46800/ 125429 | consumed samples: 11980800 | consumed tokens: 24536678400 | elapsed time per iteration (s): 1.07 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.021900E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.771 | TFLOPs: 39.62 | 15: iteration 46810/ 125429 | consumed samples: 11983360 | consumed tokens: 24541921280 | elapsed time per iteration (s): 1.12 | learning rate: 1.466E-04 | global batch size: 256 | lm loss: 2.035784E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.883 | TFLOPs: 37.82 | 15: iteration 46820/ 125429 | consumed samples: 11985920 | consumed tokens: 24547164160 | elapsed time per iteration (s): 1.10 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.043114E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.402 | TFLOPs: 38.57 | 15: iteration 46830/ 125429 | consumed samples: 11988480 | consumed tokens: 24552407040 | elapsed time per iteration (s): 1.07 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.045322E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.460 | TFLOPs: 39.57 | 15: iteration 46840/ 125429 | consumed samples: 11991040 | consumed tokens: 24557649920 | elapsed time per iteration (s): 1.13 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.013735E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.044 | TFLOPs: 37.36 | 15: iteration 46850/ 125429 | consumed samples: 11993600 | consumed tokens: 24562892800 | elapsed time per iteration (s): 1.02 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.048751E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.437 | TFLOPs: 41.39 | 15: iteration 46860/ 125429 | consumed samples: 11996160 | consumed tokens: 24568135680 | elapsed time per iteration (s): 1.03 | learning rate: 1.465E-04 | global batch size: 256 | lm loss: 2.010461E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.400 | TFLOPs: 41.22 | 15: iteration 46870/ 125429 | consumed samples: 11998720 | consumed tokens: 24573378560 | elapsed time per iteration (s): 1.04 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.027430E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.134 | TFLOPs: 40.51 | 15: iteration 46880/ 125429 | consumed samples: 12001280 | consumed tokens: 24578621440 | elapsed time per iteration (s): 1.05 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.047605E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.807 | TFLOPs: 40.13 | 15: iteration 46890/ 125429 | consumed samples: 12003840 | consumed tokens: 24583864320 | elapsed time per iteration (s): 1.06 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.012674E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.244 | TFLOPs: 40.03 | 15: iteration 46900/ 125429 | consumed samples: 12006400 | consumed tokens: 24589107200 | elapsed time per iteration (s): 1.05 | learning rate: 1.464E-04 | global batch size: 256 | lm loss: 2.033101E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.491 | TFLOPs: 40.40 | 15: iteration 46910/ 125429 | consumed samples: 12008960 | consumed tokens: 24594350080 | elapsed time per iteration (s): 1.05 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.043393E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.866 | TFLOPs: 40.47 | 15: iteration 46920/ 125429 | consumed samples: 12011520 | consumed tokens: 24599592960 | elapsed time per iteration (s): 1.07 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.065782E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.483 | TFLOPs: 39.58 | 15: iteration 46930/ 125429 | consumed samples: 12014080 | consumed tokens: 24604835840 | elapsed time per iteration (s): 1.06 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.020864E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.626 | TFLOPs: 40.10 | 15: iteration 46940/ 125429 | consumed samples: 12016640 | consumed tokens: 24610078720 | elapsed time per iteration (s): 1.06 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.045653E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.474 | TFLOPs: 39.74 | 15: iteration 46950/ 125429 | consumed samples: 12019200 | consumed tokens: 24615321600 | elapsed time per iteration (s): 1.10 | learning rate: 1.463E-04 | global batch size: 256 | lm loss: 2.020078E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.045 | TFLOPs: 38.35 | 15: iteration 46960/ 125429 | consumed samples: 12021760 | consumed tokens: 24620564480 | elapsed time per iteration (s): 1.05 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.020059E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.302 | TFLOPs: 40.21 | 15: iteration 46970/ 125429 | consumed samples: 12024320 | consumed tokens: 24625807360 | elapsed time per iteration (s): 1.05 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.042747E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.418 | TFLOPs: 40.39 | 15: iteration 46980/ 125429 | consumed samples: 12026880 | consumed tokens: 24631050240 | elapsed time per iteration (s): 1.09 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.056432E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.169 | TFLOPs: 38.86 | 15: iteration 46990/ 125429 | consumed samples: 12029440 | consumed tokens: 24636293120 | elapsed time per iteration (s): 1.04 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.034976E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.161 | TFLOPs: 40.51 | 15: iteration 47000/ 125429 | consumed samples: 12032000 | consumed tokens: 24641536000 | elapsed time per iteration (s): 1.05 | learning rate: 1.462E-04 | global batch size: 256 | lm loss: 2.037570E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.708 | TFLOPs: 40.44 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 47000 | lm loss value: 1.891361E+00 | lm loss PPL: 6.628381E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 47000 to checkpoints_1b5 0: [2022-11-26 09:48:50,234] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step47000 is begin to save! 0: [2022-11-26 09:48:50,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_01-model_00-model_states.pt... 0: [2022-11-26 09:48:50,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_01-model_00-model_states.pt. 0: [2022-11-26 09:48:50,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_03-model_00-model_states.pt... 0: [2022-11-26 09:48:50,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_03-model_00-model_states.pt. 0: [2022-11-26 09:48:50,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_04-model_00-model_states.pt... 0: [2022-11-26 09:48:50,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_04-model_00-model_states.pt. 0: [2022-11-26 09:48:50,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_05-model_00-model_states.pt... 0: [2022-11-26 09:48:50,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_05-model_00-model_states.pt. 0: [2022-11-26 09:48:50,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_06-model_00-model_states.pt... 0: [2022-11-26 09:48:50,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_06-model_00-model_states.pt. 0: [2022-11-26 09:48:50,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_07-model_00-model_states.pt... 0: [2022-11-26 09:48:51,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_07-model_00-model_states.pt. 0: [2022-11-26 09:48:51,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_08-model_00-model_states.pt... 0: [2022-11-26 09:48:51,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_08-model_00-model_states.pt. 0: [2022-11-26 09:48:51,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_09-model_00-model_states.pt... 0: [2022-11-26 09:48:51,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_09-model_00-model_states.pt. 0: [2022-11-26 09:48:51,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_10-model_00-model_states.pt... 0: [2022-11-26 09:48:51,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_10-model_00-model_states.pt. 0: [2022-11-26 09:48:51,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_11-model_00-model_states.pt... 0: [2022-11-26 09:48:51,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_11-model_00-model_states.pt. 0: [2022-11-26 09:48:51,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_12-model_00-model_states.pt... 0: [2022-11-26 09:48:51,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_12-model_00-model_states.pt. 0: [2022-11-26 09:48:51,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_13-model_00-model_states.pt... 0: [2022-11-26 09:48:51,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_13-model_00-model_states.pt. 0: [2022-11-26 09:48:51,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_14-model_00-model_states.pt... 0: [2022-11-26 09:48:51,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_14-model_00-model_states.pt. 0: [2022-11-26 09:48:51,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_15-model_00-model_states.pt... 0: [2022-11-26 09:48:51,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_15-model_00-model_states.pt. 0: [2022-11-26 09:48:51,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_16-model_00-model_states.pt... 0: [2022-11-26 09:48:51,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_16-model_00-model_states.pt. 0: [2022-11-26 09:48:51,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_17-model_00-model_states.pt... 0: [2022-11-26 09:48:52,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_17-model_00-model_states.pt. 0: [2022-11-26 09:48:52,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_18-model_00-model_states.pt... 0: [2022-11-26 09:48:52,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_18-model_00-model_states.pt. 0: [2022-11-26 09:48:52,195] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_19-model_00-model_states.pt... 0: [2022-11-26 09:48:52,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_19-model_00-model_states.pt. 0: [2022-11-26 09:48:52,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_20-model_00-model_states.pt... 0: [2022-11-26 09:48:52,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_20-model_00-model_states.pt. 0: [2022-11-26 09:48:52,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_21-model_00-model_states.pt... 0: [2022-11-26 09:48:52,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_21-model_00-model_states.pt. 0: [2022-11-26 09:48:52,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_22-model_00-model_states.pt... 0: [2022-11-26 09:48:52,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_22-model_00-model_states.pt. 0: [2022-11-26 09:48:52,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_23-model_00-model_states.pt... 0: [2022-11-26 09:48:52,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_23-model_00-model_states.pt. 0: [2022-11-26 09:48:52,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_24-model_00-model_states.pt... 0: [2022-11-26 09:48:52,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_24-model_00-model_states.pt. 0: [2022-11-26 09:48:52,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_25-model_00-model_states.pt... 0: [2022-11-26 09:48:52,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_25-model_00-model_states.pt. 0: [2022-11-26 09:48:52,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_26-model_00-model_states.pt... 0: [2022-11-26 09:48:53,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_26-model_00-model_states.pt. 0: [2022-11-26 09:48:53,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_27-model_00-model_states.pt... 0: [2022-11-26 09:48:53,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_27-model_00-model_states.pt. 0: [2022-11-26 09:48:53,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_28-model_00-model_states.pt... 0: [2022-11-26 09:48:53,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_28-model_00-model_states.pt. 0: [2022-11-26 09:48:53,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_29-model_00-model_states.pt... 0: [2022-11-26 09:48:53,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_29-model_00-model_states.pt. 0: [2022-11-26 09:48:53,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_30-model_00-model_states.pt... 0: [2022-11-26 09:48:53,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_30-model_00-model_states.pt. 0: [2022-11-26 09:48:53,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/layer_32-model_00-model_states.pt... 0: [2022-11-26 09:48:53,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/layer_32-model_00-model_states.pt. 0: [2022-11-26 09:48:53,451] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step47000/mp_rank_00_model_states.pt 0: [2022-11-26 09:48:53,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/mp_rank_00_model_states.pt... 0: [2022-11-26 09:48:53,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/mp_rank_00_model_states.pt. 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 4: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 15: [2022-11-26 09:48:53,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 7: [2022-11-26 09:48:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step47000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 12: [2022-11-26 09:48:53,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 09:48:53,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 09:48:53,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:48:53,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 09:48:53,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 09:48:53,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 09:48:53,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 09:48:53,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 9: [2022-11-26 09:48:53,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 7: [2022-11-26 09:48:53,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 09:48:53,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:48:53,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 09:48:53,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 09:48:53,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:48:53,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:48:53,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 09:48:53,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 09:48:53,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 09:48:53,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 09:48:53,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 09:48:53,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 9: [2022-11-26 09:48:53,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:48:53,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:48:53,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:48:53,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 09:48:53,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 09:48:53,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 09:48:53,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 09:48:53,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 12: [2022-11-26 09:48:53,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 09:48:53,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 09:48:53,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 09:48:53,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:48:53,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 5: [2022-11-26 09:48:53,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 09:48:53,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 09:48:53,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 09:48:53,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 9: [2022-11-26 09:48:53,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 09:48:53,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 09:48:53,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 2: [2022-11-26 09:48:53,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 09:48:53,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 09:48:53,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 09:48:53,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 09:48:53,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 09:48:53,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 8: [2022-11-26 09:48:53,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 09:48:53,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 09:48:53,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 09:48:53,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 09:48:53,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 09:48:53,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 09:48:53,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 11: [2022-11-26 09:48:53,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 14: [2022-11-26 09:48:53,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 09:48:53,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 09:48:53,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:48:53,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:48:53,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:48:53,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:48:53,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:48:53,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:48:53,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 09:48:53,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 15: [2022-11-26 09:48:53,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 7: [2022-11-26 09:48:53,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 09:48:53,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 7: [2022-11-26 09:48:53,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 3: [2022-11-26 09:48:53,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 09:48:53,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 09:48:53,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 10: [2022-11-26 09:48:53,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 09:48:53,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 09:48:53,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:48:53,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:48:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 1: [2022-11-26 09:48:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 09:48:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 09:48:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 13: [2022-11-26 09:48:53,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 09:48:53,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 09:48:53,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: [2022-11-26 09:48:53,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 09:48:53,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 6: [2022-11-26 09:48:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 4: [2022-11-26 09:48:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 09:48:53,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step47000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 09:48:53,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step47000 is ready now! 0: successfully saved checkpoint at iteration 47000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3625.21 15: iteration 47010/ 125429 | consumed samples: 12034560 | consumed tokens: 24646778880 | elapsed time per iteration (s): 1.44 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.023662E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.189 | TFLOPs: 29.28 | 15: iteration 47020/ 125429 | consumed samples: 12037120 | consumed tokens: 24652021760 | elapsed time per iteration (s): 1.09 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.034899E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.470 | TFLOPs: 38.75 | 15: iteration 47030/ 125429 | consumed samples: 12039680 | consumed tokens: 24657264640 | elapsed time per iteration (s): 1.07 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.012551E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.138 | TFLOPs: 39.68 | 15: iteration 47040/ 125429 | consumed samples: 12042240 | consumed tokens: 24662507520 | elapsed time per iteration (s): 1.04 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.045739E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.264 | TFLOPs: 40.86 | 15: iteration 47050/ 125429 | consumed samples: 12044800 | consumed tokens: 24667750400 | elapsed time per iteration (s): 1.05 | learning rate: 1.461E-04 | global batch size: 256 | lm loss: 2.030909E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.819 | TFLOPs: 40.29 | 15: iteration 47060/ 125429 | consumed samples: 12047360 | consumed tokens: 24672993280 | elapsed time per iteration (s): 1.03 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.035506E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.893 | TFLOPs: 41.13 | 15: iteration 47070/ 125429 | consumed samples: 12049920 | consumed tokens: 24678236160 | elapsed time per iteration (s): 1.03 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.022641E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.394 | TFLOPs: 41.21 | 15: iteration 47080/ 125429 | consumed samples: 12052480 | consumed tokens: 24683479040 | elapsed time per iteration (s): 1.04 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.031777E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.896 | TFLOPs: 40.80 | 15: iteration 47090/ 125429 | consumed samples: 12055040 | consumed tokens: 24688721920 | elapsed time per iteration (s): 1.04 | learning rate: 1.460E-04 | global batch size: 256 | lm loss: 2.001606E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.896 | TFLOPs: 40.80 | 15: iteration 47100/ 125429 | consumed samples: 12057600 | consumed tokens: 24693964800 | elapsed time per iteration (s): 4.00 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.016547E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 64.070 | TFLOPs: 10.59 | 15: iteration 47110/ 125429 | consumed samples: 12060160 | consumed tokens: 24699207680 | elapsed time per iteration (s): 1.03 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.029181E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.858 | TFLOPs: 40.96 | 15: iteration 47120/ 125429 | consumed samples: 12062720 | consumed tokens: 24704450560 | elapsed time per iteration (s): 1.03 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.033388E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.644 | TFLOPs: 41.26 | 15: iteration 47130/ 125429 | consumed samples: 12065280 | consumed tokens: 24709693440 | elapsed time per iteration (s): 1.08 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.025176E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.034 | TFLOPs: 39.34 | 15: iteration 47140/ 125429 | consumed samples: 12067840 | consumed tokens: 24714936320 | elapsed time per iteration (s): 1.03 | learning rate: 1.459E-04 | global batch size: 256 | lm loss: 2.012783E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.402 | TFLOPs: 41.22 | 15: iteration 47150/ 125429 | consumed samples: 12070400 | consumed tokens: 24720179200 | elapsed time per iteration (s): 1.02 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.012734E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.125 | TFLOPs: 41.50 | 15: iteration 47160/ 125429 | consumed samples: 12072960 | consumed tokens: 24725422080 | elapsed time per iteration (s): 1.05 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.049318E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.008 | TFLOPs: 40.32 | 15: iteration 47170/ 125429 | consumed samples: 12075520 | consumed tokens: 24730664960 | elapsed time per iteration (s): 5.58 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.028612E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 45.902 | TFLOPs: 7.59 | 15: iteration 47180/ 125429 | consumed samples: 12078080 | consumed tokens: 24735907840 | elapsed time per iteration (s): 1.02 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.046337E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.979 | TFLOPs: 41.31 | 15: iteration 47190/ 125429 | consumed samples: 12080640 | consumed tokens: 24741150720 | elapsed time per iteration (s): 1.02 | learning rate: 1.458E-04 | global batch size: 256 | lm loss: 2.028552E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.896 | TFLOPs: 41.30 | 15: iteration 47200/ 125429 | consumed samples: 12083200 | consumed tokens: 24746393600 | elapsed time per iteration (s): 1.02 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.035877E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.290 | TFLOPs: 41.36 | 15: iteration 47210/ 125429 | consumed samples: 12085760 | consumed tokens: 24751636480 | elapsed time per iteration (s): 2.07 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.058860E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 123.891 | TFLOPs: 20.47 | 15: iteration 47220/ 125429 | consumed samples: 12088320 | consumed tokens: 24756879360 | elapsed time per iteration (s): 1.04 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.013451E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.622 | TFLOPs: 40.59 | 15: iteration 47230/ 125429 | consumed samples: 12090880 | consumed tokens: 24762122240 | elapsed time per iteration (s): 1.03 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 2.012015E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.697 | TFLOPs: 40.93 | 15: iteration 47240/ 125429 | consumed samples: 12093440 | consumed tokens: 24767365120 | elapsed time per iteration (s): 1.02 | learning rate: 1.457E-04 | global batch size: 256 | lm loss: 1.992795E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.860 | TFLOPs: 41.46 | 15: iteration 47250/ 125429 | consumed samples: 12096000 | consumed tokens: 24772608000 | elapsed time per iteration (s): 1.05 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.027880E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.070 | TFLOPs: 40.17 | 15: iteration 47260/ 125429 | consumed samples: 12098560 | consumed tokens: 24777850880 | elapsed time per iteration (s): 1.05 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.024481E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.961 | TFLOPs: 40.32 | 15: iteration 47270/ 125429 | consumed samples: 12101120 | consumed tokens: 24783093760 | elapsed time per iteration (s): 1.03 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 1.989484E+00 | grad norm: 0.121 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.174 | TFLOPs: 41.01 | 15: iteration 47280/ 125429 | consumed samples: 12103680 | consumed tokens: 24788336640 | elapsed time per iteration (s): 1.07 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.033774E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.921 | TFLOPs: 39.65 | 15: iteration 47290/ 125429 | consumed samples: 12106240 | consumed tokens: 24793579520 | elapsed time per iteration (s): 1.04 | learning rate: 1.456E-04 | global batch size: 256 | lm loss: 2.029638E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.896 | TFLOPs: 40.64 | 15: iteration 47300/ 125429 | consumed samples: 12108800 | consumed tokens: 24798822400 | elapsed time per iteration (s): 1.03 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 1.990789E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.370 | TFLOPs: 40.88 | 15: iteration 47310/ 125429 | consumed samples: 12111360 | consumed tokens: 24804065280 | elapsed time per iteration (s): 1.02 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.055423E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.689 | TFLOPs: 41.43 | 15: iteration 47320/ 125429 | consumed samples: 12113920 | consumed tokens: 24809308160 | elapsed time per iteration (s): 1.11 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.019692E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.406 | TFLOPs: 38.08 | 15: iteration 47330/ 125429 | consumed samples: 12116480 | consumed tokens: 24814551040 | elapsed time per iteration (s): 1.05 | learning rate: 1.455E-04 | global batch size: 256 | lm loss: 2.049196E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.957 | TFLOPs: 40.15 | 15: iteration 47340/ 125429 | consumed samples: 12119040 | consumed tokens: 24819793920 | elapsed time per iteration (s): 1.03 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.025382E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.323 | TFLOPs: 41.20 | 15: iteration 47350/ 125429 | consumed samples: 12121600 | consumed tokens: 24825036800 | elapsed time per iteration (s): 1.02 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.025079E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.055 | TFLOPs: 41.49 | 15: iteration 47360/ 125429 | consumed samples: 12124160 | consumed tokens: 24830279680 | elapsed time per iteration (s): 1.03 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.040149E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.170 | TFLOPs: 41.18 | 15: iteration 47370/ 125429 | consumed samples: 12126720 | consumed tokens: 24835522560 | elapsed time per iteration (s): 1.06 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 1.999047E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.193 | TFLOPs: 40.02 | 15: iteration 47380/ 125429 | consumed samples: 12129280 | consumed tokens: 24840765440 | elapsed time per iteration (s): 1.06 | learning rate: 1.454E-04 | global batch size: 256 | lm loss: 2.011466E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.688 | TFLOPs: 39.94 | 15: iteration 47390/ 125429 | consumed samples: 12131840 | consumed tokens: 24846008320 | elapsed time per iteration (s): 1.05 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.036629E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.711 | TFLOPs: 40.28 | 15: iteration 47400/ 125429 | consumed samples: 12134400 | consumed tokens: 24851251200 | elapsed time per iteration (s): 1.05 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.023157E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.690 | TFLOPs: 40.27 | 15: iteration 47410/ 125429 | consumed samples: 12136960 | consumed tokens: 24856494080 | elapsed time per iteration (s): 1.10 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.025450E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.719 | TFLOPs: 38.29 | 15: iteration 47420/ 125429 | consumed samples: 12139520 | consumed tokens: 24861736960 | elapsed time per iteration (s): 1.05 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.028538E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.710 | TFLOPs: 40.11 | 15: iteration 47430/ 125429 | consumed samples: 12142080 | consumed tokens: 24866979840 | elapsed time per iteration (s): 1.05 | learning rate: 1.453E-04 | global batch size: 256 | lm loss: 2.032943E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.415 | TFLOPs: 40.23 | 15: iteration 47440/ 125429 | consumed samples: 12144640 | consumed tokens: 24872222720 | elapsed time per iteration (s): 1.03 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.005757E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.843 | TFLOPs: 40.96 | 15: iteration 47450/ 125429 | consumed samples: 12147200 | consumed tokens: 24877465600 | elapsed time per iteration (s): 1.05 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.007720E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.330 | TFLOPs: 40.38 | 15: iteration 47460/ 125429 | consumed samples: 12149760 | consumed tokens: 24882708480 | elapsed time per iteration (s): 1.04 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.035825E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.921 | TFLOPs: 40.81 | 15: iteration 47470/ 125429 | consumed samples: 12152320 | consumed tokens: 24887951360 | elapsed time per iteration (s): 1.04 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.053484E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.627 | TFLOPs: 40.59 | 15: iteration 47480/ 125429 | consumed samples: 12154880 | consumed tokens: 24893194240 | elapsed time per iteration (s): 1.03 | learning rate: 1.452E-04 | global batch size: 256 | lm loss: 2.025209E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.335 | TFLOPs: 41.20 | 15: iteration 47490/ 125429 | consumed samples: 12157440 | consumed tokens: 24898437120 | elapsed time per iteration (s): 1.08 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.017105E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.121 | TFLOPs: 39.35 | 15: iteration 47500/ 125429 | consumed samples: 12160000 | consumed tokens: 24903680000 | elapsed time per iteration (s): 1.06 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.026826E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.760 | TFLOPs: 39.79 | 15: iteration 47510/ 125429 | consumed samples: 12162560 | consumed tokens: 24908922880 | elapsed time per iteration (s): 1.05 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.015593E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.401 | TFLOPs: 40.22 | 15: iteration 47520/ 125429 | consumed samples: 12165120 | consumed tokens: 24914165760 | elapsed time per iteration (s): 1.05 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.017161E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.861 | TFLOPs: 40.47 | 15: iteration 47530/ 125429 | consumed samples: 12167680 | consumed tokens: 24919408640 | elapsed time per iteration (s): 1.06 | learning rate: 1.451E-04 | global batch size: 256 | lm loss: 2.022285E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.592 | TFLOPs: 40.09 | 15: iteration 47540/ 125429 | consumed samples: 12170240 | consumed tokens: 24924651520 | elapsed time per iteration (s): 1.03 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.039771E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.441 | TFLOPs: 41.06 | 15: iteration 47550/ 125429 | consumed samples: 12172800 | consumed tokens: 24929894400 | elapsed time per iteration (s): 1.05 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.028064E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.569 | TFLOPs: 40.42 | 15: iteration 47560/ 125429 | consumed samples: 12175360 | consumed tokens: 24935137280 | elapsed time per iteration (s): 1.03 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.042729E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.561 | TFLOPs: 41.08 | 15: iteration 47570/ 125429 | consumed samples: 12177920 | consumed tokens: 24940380160 | elapsed time per iteration (s): 1.05 | learning rate: 1.450E-04 | global batch size: 256 | lm loss: 2.019201E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.569 | TFLOPs: 40.42 | 15: iteration 47580/ 125429 | consumed samples: 12180480 | consumed tokens: 24945623040 | elapsed time per iteration (s): 1.04 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.015174E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.278 | TFLOPs: 40.86 | 15: iteration 47590/ 125429 | consumed samples: 12183040 | consumed tokens: 24950865920 | elapsed time per iteration (s): 1.03 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.046046E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.590 | TFLOPs: 40.92 | 15: iteration 47600/ 125429 | consumed samples: 12185600 | consumed tokens: 24956108800 | elapsed time per iteration (s): 1.04 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.009451E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.432 | TFLOPs: 40.56 | 15: iteration 47610/ 125429 | consumed samples: 12188160 | consumed tokens: 24961351680 | elapsed time per iteration (s): 1.03 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.018548E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.309 | TFLOPs: 41.04 | 15: iteration 47620/ 125429 | consumed samples: 12190720 | consumed tokens: 24966594560 | elapsed time per iteration (s): 1.04 | learning rate: 1.449E-04 | global batch size: 256 | lm loss: 2.015755E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.764 | TFLOPs: 40.78 | 15: iteration 47630/ 125429 | consumed samples: 12193280 | consumed tokens: 24971837440 | elapsed time per iteration (s): 1.07 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.021545E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.161 | TFLOPs: 39.52 | 15: iteration 47640/ 125429 | consumed samples: 12195840 | consumed tokens: 24977080320 | elapsed time per iteration (s): 1.04 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.014828E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.074 | TFLOPs: 40.67 | 15: iteration 47650/ 125429 | consumed samples: 12198400 | consumed tokens: 24982323200 | elapsed time per iteration (s): 1.03 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.001350E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.099 | TFLOPs: 41.00 | 15: iteration 47660/ 125429 | consumed samples: 12200960 | consumed tokens: 24987566080 | elapsed time per iteration (s): 1.05 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.045086E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.373 | TFLOPs: 40.38 | 15: iteration 47670/ 125429 | consumed samples: 12203520 | consumed tokens: 24992808960 | elapsed time per iteration (s): 1.06 | learning rate: 1.448E-04 | global batch size: 256 | lm loss: 2.064278E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.609 | TFLOPs: 39.76 | 15: iteration 47680/ 125429 | consumed samples: 12206080 | consumed tokens: 24998051840 | elapsed time per iteration (s): 1.08 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.014638E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.674 | TFLOPs: 39.11 | 15: iteration 47690/ 125429 | consumed samples: 12208640 | consumed tokens: 25003294720 | elapsed time per iteration (s): 1.04 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.020013E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.009 | TFLOPs: 40.82 | 15: iteration 47700/ 125429 | consumed samples: 12211200 | consumed tokens: 25008537600 | elapsed time per iteration (s): 1.06 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.025345E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.717 | TFLOPs: 39.95 | 15: iteration 47710/ 125429 | consumed samples: 12213760 | consumed tokens: 25013780480 | elapsed time per iteration (s): 1.05 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.013179E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.126 | TFLOPs: 40.34 | 15: iteration 47720/ 125429 | consumed samples: 12216320 | consumed tokens: 25019023360 | elapsed time per iteration (s): 1.03 | learning rate: 1.447E-04 | global batch size: 256 | lm loss: 2.036930E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.580 | TFLOPs: 41.24 | 15: iteration 47730/ 125429 | consumed samples: 12218880 | consumed tokens: 25024266240 | elapsed time per iteration (s): 2.40 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 1.997077E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 106.652 | TFLOPs: 17.63 | 15: iteration 47740/ 125429 | consumed samples: 12221440 | consumed tokens: 25029509120 | elapsed time per iteration (s): 1.02 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.000547E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.692 | TFLOPs: 41.43 | 15: iteration 47750/ 125429 | consumed samples: 12224000 | consumed tokens: 25034752000 | elapsed time per iteration (s): 1.05 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 2.007910E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.268 | TFLOPs: 40.20 | 15: iteration 47760/ 125429 | consumed samples: 12226560 | consumed tokens: 25039994880 | elapsed time per iteration (s): 1.03 | learning rate: 1.446E-04 | global batch size: 256 | lm loss: 1.997812E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.666 | TFLOPs: 41.09 | 15: iteration 47770/ 125429 | consumed samples: 12229120 | consumed tokens: 25045237760 | elapsed time per iteration (s): 1.04 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.032211E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.707 | TFLOPs: 40.77 | 15: iteration 47780/ 125429 | consumed samples: 12231680 | consumed tokens: 25050480640 | elapsed time per iteration (s): 1.04 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.024786E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.103 | TFLOPs: 40.67 | 15: iteration 47790/ 125429 | consumed samples: 12234240 | consumed tokens: 25055723520 | elapsed time per iteration (s): 1.04 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.028445E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.329 | TFLOPs: 40.54 | 15: iteration 47800/ 125429 | consumed samples: 12236800 | consumed tokens: 25060966400 | elapsed time per iteration (s): 1.03 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.042101E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.474 | TFLOPs: 41.06 | 15: iteration 47810/ 125429 | consumed samples: 12239360 | consumed tokens: 25066209280 | elapsed time per iteration (s): 1.02 | learning rate: 1.445E-04 | global batch size: 256 | lm loss: 2.007520E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.953 | TFLOPs: 41.31 | 15: iteration 47820/ 125429 | consumed samples: 12241920 | consumed tokens: 25071452160 | elapsed time per iteration (s): 1.03 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.008739E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.905 | TFLOPs: 41.13 | 15: iteration 47830/ 125429 | consumed samples: 12244480 | consumed tokens: 25076695040 | elapsed time per iteration (s): 1.03 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.034766E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.081 | TFLOPs: 41.00 | 15: iteration 47840/ 125429 | consumed samples: 12247040 | consumed tokens: 25081937920 | elapsed time per iteration (s): 1.04 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.029179E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.081 | TFLOPs: 40.83 | 15: iteration 47850/ 125429 | consumed samples: 12249600 | consumed tokens: 25087180800 | elapsed time per iteration (s): 1.04 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.029879E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.320 | TFLOPs: 40.71 | 15: iteration 47860/ 125429 | consumed samples: 12252160 | consumed tokens: 25092423680 | elapsed time per iteration (s): 1.04 | learning rate: 1.444E-04 | global batch size: 256 | lm loss: 2.006392E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.107 | TFLOPs: 40.67 | 15: iteration 47870/ 125429 | consumed samples: 12254720 | consumed tokens: 25097666560 | elapsed time per iteration (s): 1.05 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.034625E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.777 | TFLOPs: 40.29 | 15: iteration 47880/ 125429 | consumed samples: 12257280 | consumed tokens: 25102909440 | elapsed time per iteration (s): 1.05 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.031663E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.812 | TFLOPs: 40.29 | 15: iteration 47890/ 125429 | consumed samples: 12259840 | consumed tokens: 25108152320 | elapsed time per iteration (s): 1.04 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.029882E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.879 | TFLOPs: 40.80 | 15: iteration 47900/ 125429 | consumed samples: 12262400 | consumed tokens: 25113395200 | elapsed time per iteration (s): 1.02 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.011523E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.472 | TFLOPs: 41.39 | 15: iteration 47910/ 125429 | consumed samples: 12264960 | consumed tokens: 25118638080 | elapsed time per iteration (s): 1.03 | learning rate: 1.443E-04 | global batch size: 256 | lm loss: 2.042605E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.772 | TFLOPs: 41.11 | 15: iteration 47920/ 125429 | consumed samples: 12267520 | consumed tokens: 25123880960 | elapsed time per iteration (s): 1.05 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.016311E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.837 | TFLOPs: 40.13 | 15: iteration 47930/ 125429 | consumed samples: 12270080 | consumed tokens: 25129123840 | elapsed time per iteration (s): 1.07 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.020469E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.603 | TFLOPs: 39.60 | 15: iteration 47940/ 125429 | consumed samples: 12272640 | consumed tokens: 25134366720 | elapsed time per iteration (s): 1.07 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.040749E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.750 | TFLOPs: 39.46 | 15: iteration 47950/ 125429 | consumed samples: 12275200 | consumed tokens: 25139609600 | elapsed time per iteration (s): 1.05 | learning rate: 1.442E-04 | global batch size: 256 | lm loss: 2.025016E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.850 | TFLOPs: 40.30 | 15: iteration 47960/ 125429 | consumed samples: 12277760 | consumed tokens: 25144852480 | elapsed time per iteration (s): 1.07 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.015683E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.180 | TFLOPs: 39.53 | 15: iteration 47970/ 125429 | consumed samples: 12280320 | consumed tokens: 25150095360 | elapsed time per iteration (s): 1.03 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.033299E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.860 | TFLOPs: 41.13 | 15: iteration 47980/ 125429 | consumed samples: 12282880 | consumed tokens: 25155338240 | elapsed time per iteration (s): 1.04 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.040983E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.054 | TFLOPs: 40.50 | 15: iteration 47990/ 125429 | consumed samples: 12285440 | consumed tokens: 25160581120 | elapsed time per iteration (s): 1.02 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.031903E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.908 | TFLOPs: 41.46 | 0: [2022-11-26 10:07:56,168] [INFO] [logging.py:68:log_dist] [Rank 0] step=48000, skipped=0, lr=[0.00014406213204787955, 0.00014406213204787955, 0.00014406213204787955], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 48000/ 125429 | consumed samples: 12288000 | consumed tokens: 25165824000 | elapsed time per iteration (s): 1.04 | learning rate: 1.441E-04 | global batch size: 256 | lm loss: 2.016200E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.015 | TFLOPs: 40.66 | 0: steps: 48000 loss: 2.0438 iter time (s): 1.101 samples/sec: 232.526 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 48000 | lm loss value: 2.038142E+00 | lm loss PPL: 7.676331E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 48000 to checkpoints_1b5 0: [2022-11-26 10:07:56,545] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step48000 is begin to save! 0: [2022-11-26 10:07:56,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:07:56,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:07:56,784] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:07:56,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:07:56,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:07:56,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:07:56,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:07:57,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:07:57,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:07:57,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:07:57,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:07:57,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:07:57,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:07:57,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:07:57,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:07:57,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:07:57,485] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:07:57,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:07:57,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:07:57,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:07:57,686] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:07:57,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:07:57,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:07:57,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:07:57,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:07:58,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:07:58,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:07:58,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:07:58,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:07:58,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:07:58,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:07:58,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:07:58,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:07:58,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:07:58,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:07:58,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:07:58,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:07:58,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:07:58,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:07:58,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:07:58,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:07:58,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:07:58,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:07:58,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:07:58,980] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:07:59,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:07:59,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:07:59,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:07:59,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:07:59,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:07:59,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:07:59,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:07:59,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:07:59,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:07:59,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_29-model_00-model_states.pt... 0: [2022-11-26 10:07:59,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_29-model_00-model_states.pt. 0: [2022-11-26 10:07:59,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:07:59,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:07:59,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/layer_32-model_00-model_states.pt... 0: [2022-11-26 10:07:59,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/layer_32-model_00-model_states.pt. 0: [2022-11-26 10:07:59,737] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step48000/mp_rank_00_model_states.pt 0: [2022-11-26 10:07:59,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:07:59,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:07:59,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:07:59,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step48000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:07:59,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:07:59,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:07:59,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:07:59,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:07:59,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 10:07:59,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 10:07:59,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 10:07:59,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:07:59,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 10:07:59,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:07:59,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 10:07:59,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 10:07:59,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:07:59,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:07:59,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 10:07:59,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 10:07:59,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 12: [2022-11-26 10:07:59,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:07:59,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 10:07:59,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:07:59,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:07:59,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 10:07:59,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:07:59,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 10:07:59,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 3: [2022-11-26 10:07:59,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 3: [2022-11-26 10:07:59,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 15: [2022-11-26 10:07:59,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 10:07:59,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:07:59,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:07:59,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:07:59,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 6: [2022-11-26 10:07:59,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 3: [2022-11-26 10:07:59,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:07:59,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:07:59,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:07:59,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:07:59,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 6: [2022-11-26 10:07:59,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 5: [2022-11-26 10:07:59,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 14: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 10:07:59,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:07:59,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:07:59,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:07:59,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 10:07:59,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:07:59,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:07:59,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:07:59,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 10:07:59,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:07:59,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:07:59,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 10:07:59,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:07:59,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 10:07:59,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 10:07:59,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 10:07:59,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 10:07:59,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 10:07:59,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 10:07:59,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 10: [2022-11-26 10:07:59,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 5: [2022-11-26 10:07:59,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 10:07:59,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:07:59,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 10:07:59,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:07:59,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:07:59,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 10:07:59,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 10:07:59,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:07:59,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:07:59,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:07:59,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:07:59,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 10:07:59,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 10:07:59,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:07:59,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 5: [2022-11-26 10:07:59,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:07:59,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:07:59,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 10:07:59,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:07:59,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 10:07:59,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:07:59,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 10:07:59,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 15: [2022-11-26 10:07:59,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:07:59,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:07:59,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 10:07:59,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:07:59,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:07:59,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 9: [2022-11-26 10:07:59,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 15: [2022-11-26 10:07:59,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:07:59,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 10:07:59,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:07:59,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 10:07:59,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 10:07:59,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:07:59,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:07:59,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 10: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:07:59,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:07:59,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:07:59,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 10:07:59,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:07:59,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:07:59,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 10:07:59,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:07:59,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:07:59,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 10:07:59,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:07:59,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:07:59,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:07:59,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:07:59,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:07:59,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 10:07:59,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:07:59,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 10:07:59,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:07:59,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:07:59,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:07:59,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:07:59,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 1: [2022-11-26 10:07:59,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:07:59,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 10:07:59,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 10:07:59,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:07:59,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 10:07:59,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:07:59,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:07:59,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:07:59,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:07:59,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:07:59,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 13: [2022-11-26 10:07:59,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:07:59,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:07:59,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:07:59,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:07:59,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:07:59,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:07:59,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:07:59,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:07:59,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:07:59,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 10:07:59,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:08:00,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:08:00,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:08:00,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 10:08:00,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:08:00,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 9: [2022-11-26 10:08:00,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 10:08:00,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:08:00,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 10:08:00,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 12: [2022-11-26 10:08:00,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:08:00,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 10:08:00,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 14: [2022-11-26 10:08:00,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:08:00,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 10:08:00,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 10:08:00,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:08:00,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 10:08:00,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 3: [2022-11-26 10:08:00,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:08:00,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 10:08:00,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:08:00,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:08:00,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:08:00,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 6: [2022-11-26 10:08:00,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:08:00,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:08:00,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 2: [2022-11-26 10:08:00,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:08:00,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:08:00,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:08:00,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:08:00,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 10:08:00,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:08:00,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:08:00,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 10:08:00,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:08:00,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:08:00,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 10:08:00,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:08:00,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:08:00,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 10:08:00,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 8: [2022-11-26 10:08:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 7: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: [2022-11-26 10:08:00,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:08:00,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 4: [2022-11-26 10:08:00,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 11: [2022-11-26 10:08:00,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:08:00,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step48000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 10:08:00,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step48000 is ready now! 0: successfully saved checkpoint at iteration 48000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3764.10 15: iteration 48010/ 125429 | consumed samples: 12290560 | consumed tokens: 25171066880 | elapsed time per iteration (s): 1.46 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.023023E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.157 | TFLOPs: 28.95 | 15: iteration 48020/ 125429 | consumed samples: 12293120 | consumed tokens: 25176309760 | elapsed time per iteration (s): 1.06 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.068700E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.583 | TFLOPs: 39.92 | 15: iteration 48030/ 125429 | consumed samples: 12295680 | consumed tokens: 25181552640 | elapsed time per iteration (s): 1.04 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.024379E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.607 | TFLOPs: 40.75 | 15: iteration 48040/ 125429 | consumed samples: 12298240 | consumed tokens: 25186795520 | elapsed time per iteration (s): 1.04 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.027109E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.249 | TFLOPs: 40.69 | 15: iteration 48050/ 125429 | consumed samples: 12300800 | consumed tokens: 25192038400 | elapsed time per iteration (s): 1.06 | learning rate: 1.440E-04 | global batch size: 256 | lm loss: 2.010865E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.709 | TFLOPs: 39.78 | 15: iteration 48060/ 125429 | consumed samples: 12303360 | consumed tokens: 25197281280 | elapsed time per iteration (s): 1.06 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.052983E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.204 | TFLOPs: 39.86 | 15: iteration 48070/ 125429 | consumed samples: 12305920 | consumed tokens: 25202524160 | elapsed time per iteration (s): 1.41 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.010847E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 182.130 | TFLOPs: 30.10 | 15: iteration 48080/ 125429 | consumed samples: 12308480 | consumed tokens: 25207767040 | elapsed time per iteration (s): 1.20 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.036890E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.212 | TFLOPs: 35.40 | 15: iteration 48090/ 125429 | consumed samples: 12311040 | consumed tokens: 25213009920 | elapsed time per iteration (s): 1.06 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.013242E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.445 | TFLOPs: 40.07 | 15: iteration 48100/ 125429 | consumed samples: 12313600 | consumed tokens: 25218252800 | elapsed time per iteration (s): 1.08 | learning rate: 1.439E-04 | global batch size: 256 | lm loss: 2.013795E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.836 | TFLOPs: 39.30 | 15: iteration 48110/ 125429 | consumed samples: 12316160 | consumed tokens: 25223495680 | elapsed time per iteration (s): 1.04 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.014615E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.372 | TFLOPs: 40.71 | 15: iteration 48120/ 125429 | consumed samples: 12318720 | consumed tokens: 25228738560 | elapsed time per iteration (s): 1.05 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.003400E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.988 | TFLOPs: 40.32 | 15: iteration 48130/ 125429 | consumed samples: 12321280 | consumed tokens: 25233981440 | elapsed time per iteration (s): 1.03 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.017036E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.457 | TFLOPs: 40.89 | 15: iteration 48140/ 125429 | consumed samples: 12323840 | consumed tokens: 25239224320 | elapsed time per iteration (s): 1.03 | learning rate: 1.438E-04 | global batch size: 256 | lm loss: 2.027227E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.391 | TFLOPs: 40.88 | 15: iteration 48150/ 125429 | consumed samples: 12326400 | consumed tokens: 25244467200 | elapsed time per iteration (s): 1.08 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.016250E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.351 | TFLOPs: 39.06 | 15: iteration 48160/ 125429 | consumed samples: 12328960 | consumed tokens: 25249710080 | elapsed time per iteration (s): 1.05 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.029502E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.963 | TFLOPs: 40.15 | 15: iteration 48170/ 125429 | consumed samples: 12331520 | consumed tokens: 25254952960 | elapsed time per iteration (s): 1.09 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.053662E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.702 | TFLOPs: 38.95 | 15: iteration 48180/ 125429 | consumed samples: 12334080 | consumed tokens: 25260195840 | elapsed time per iteration (s): 1.03 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.004586E+00 | grad norm: 0.207 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.664 | TFLOPs: 40.93 | 15: iteration 48190/ 125429 | consumed samples: 12336640 | consumed tokens: 25265438720 | elapsed time per iteration (s): 1.04 | learning rate: 1.437E-04 | global batch size: 256 | lm loss: 2.077353E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.208 | TFLOPs: 40.69 | 15: iteration 48200/ 125429 | consumed samples: 12339200 | consumed tokens: 25270681600 | elapsed time per iteration (s): 1.08 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.027663E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.528 | TFLOPs: 39.09 | 15: iteration 48210/ 125429 | consumed samples: 12341760 | consumed tokens: 25275924480 | elapsed time per iteration (s): 1.02 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.023590E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.927 | TFLOPs: 41.30 | 15: iteration 48220/ 125429 | consumed samples: 12344320 | consumed tokens: 25281167360 | elapsed time per iteration (s): 1.03 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.027421E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.891 | TFLOPs: 41.13 | 15: iteration 48230/ 125429 | consumed samples: 12346880 | consumed tokens: 25286410240 | elapsed time per iteration (s): 1.02 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.011818E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.947 | TFLOPs: 41.47 | 15: iteration 48240/ 125429 | consumed samples: 12349440 | consumed tokens: 25291653120 | elapsed time per iteration (s): 1.04 | learning rate: 1.436E-04 | global batch size: 256 | lm loss: 2.015701E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.834 | TFLOPs: 40.63 | 15: iteration 48250/ 125429 | consumed samples: 12352000 | consumed tokens: 25296896000 | elapsed time per iteration (s): 1.05 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.038593E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.791 | TFLOPs: 40.29 | 15: iteration 48260/ 125429 | consumed samples: 12354560 | consumed tokens: 25302138880 | elapsed time per iteration (s): 1.04 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 1.978012E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.909 | TFLOPs: 40.80 | 15: iteration 48270/ 125429 | consumed samples: 12357120 | consumed tokens: 25307381760 | elapsed time per iteration (s): 1.05 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.046242E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.485 | TFLOPs: 40.40 | 15: iteration 48280/ 125429 | consumed samples: 12359680 | consumed tokens: 25312624640 | elapsed time per iteration (s): 1.06 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.035667E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.627 | TFLOPs: 40.10 | 15: iteration 48290/ 125429 | consumed samples: 12362240 | consumed tokens: 25317867520 | elapsed time per iteration (s): 1.07 | learning rate: 1.435E-04 | global batch size: 256 | lm loss: 2.035474E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.677 | TFLOPs: 39.61 | 15: iteration 48300/ 125429 | consumed samples: 12364800 | consumed tokens: 25323110400 | elapsed time per iteration (s): 1.04 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.039185E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.472 | TFLOPs: 40.73 | 15: iteration 48310/ 125429 | consumed samples: 12367360 | consumed tokens: 25328353280 | elapsed time per iteration (s): 1.04 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.055984E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.190 | TFLOPs: 40.68 | 15: iteration 48320/ 125429 | consumed samples: 12369920 | consumed tokens: 25333596160 | elapsed time per iteration (s): 1.08 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.018120E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.998 | TFLOPs: 39.00 | 15: iteration 48330/ 125429 | consumed samples: 12372480 | consumed tokens: 25338839040 | elapsed time per iteration (s): 1.06 | learning rate: 1.434E-04 | global batch size: 256 | lm loss: 2.013297E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.312 | TFLOPs: 40.04 | 15: iteration 48340/ 125429 | consumed samples: 12375040 | consumed tokens: 25344081920 | elapsed time per iteration (s): 1.05 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.008464E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.764 | TFLOPs: 40.28 | 15: iteration 48350/ 125429 | consumed samples: 12377600 | consumed tokens: 25349324800 | elapsed time per iteration (s): 1.11 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.013485E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.174 | TFLOPs: 38.20 | 15: iteration 48360/ 125429 | consumed samples: 12380160 | consumed tokens: 25354567680 | elapsed time per iteration (s): 1.03 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.026772E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.955 | TFLOPs: 40.98 | 15: iteration 48370/ 125429 | consumed samples: 12382720 | consumed tokens: 25359810560 | elapsed time per iteration (s): 1.07 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.041999E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.225 | TFLOPs: 39.53 | 15: iteration 48380/ 125429 | consumed samples: 12385280 | consumed tokens: 25365053440 | elapsed time per iteration (s): 1.06 | learning rate: 1.433E-04 | global batch size: 256 | lm loss: 2.024706E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.564 | TFLOPs: 39.76 | 15: iteration 48390/ 125429 | consumed samples: 12387840 | consumed tokens: 25370296320 | elapsed time per iteration (s): 1.05 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.010219E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.631 | TFLOPs: 40.43 | 15: iteration 48400/ 125429 | consumed samples: 12390400 | consumed tokens: 25375539200 | elapsed time per iteration (s): 1.03 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.019835E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.635 | TFLOPs: 41.09 | 15: iteration 48410/ 125429 | consumed samples: 12392960 | consumed tokens: 25380782080 | elapsed time per iteration (s): 1.04 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.022623E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.895 | TFLOPs: 40.64 | 15: iteration 48420/ 125429 | consumed samples: 12395520 | consumed tokens: 25386024960 | elapsed time per iteration (s): 1.06 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.014387E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.342 | TFLOPs: 39.88 | 15: iteration 48430/ 125429 | consumed samples: 12398080 | consumed tokens: 25391267840 | elapsed time per iteration (s): 1.04 | learning rate: 1.432E-04 | global batch size: 256 | lm loss: 2.039186E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.165 | TFLOPs: 40.52 | 15: iteration 48440/ 125429 | consumed samples: 12400640 | consumed tokens: 25396510720 | elapsed time per iteration (s): 1.03 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.010694E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.917 | TFLOPs: 41.14 | 15: iteration 48450/ 125429 | consumed samples: 12403200 | consumed tokens: 25401753600 | elapsed time per iteration (s): 1.05 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 1.997048E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.593 | TFLOPs: 40.42 | 15: iteration 48460/ 125429 | consumed samples: 12405760 | consumed tokens: 25406996480 | elapsed time per iteration (s): 1.08 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.043758E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.249 | TFLOPs: 39.04 | 15: iteration 48470/ 125429 | consumed samples: 12408320 | consumed tokens: 25412239360 | elapsed time per iteration (s): 1.06 | learning rate: 1.431E-04 | global batch size: 256 | lm loss: 2.005628E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.479 | TFLOPs: 39.91 | 15: iteration 48480/ 125429 | consumed samples: 12410880 | consumed tokens: 25417482240 | elapsed time per iteration (s): 1.03 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.009535E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.083 | TFLOPs: 41.00 | 15: iteration 48490/ 125429 | consumed samples: 12413440 | consumed tokens: 25422725120 | elapsed time per iteration (s): 1.14 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.047329E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.646 | TFLOPs: 36.96 | 15: iteration 48500/ 125429 | consumed samples: 12416000 | consumed tokens: 25427968000 | elapsed time per iteration (s): 1.07 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.028841E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.988 | TFLOPs: 39.49 | 15: iteration 48510/ 125429 | consumed samples: 12418560 | consumed tokens: 25433210880 | elapsed time per iteration (s): 1.06 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.035212E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.920 | TFLOPs: 39.98 | 15: iteration 48520/ 125429 | consumed samples: 12421120 | consumed tokens: 25438453760 | elapsed time per iteration (s): 1.03 | learning rate: 1.430E-04 | global batch size: 256 | lm loss: 2.008769E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.789 | TFLOPs: 40.95 | 15: iteration 48530/ 125429 | consumed samples: 12423680 | consumed tokens: 25443696640 | elapsed time per iteration (s): 1.04 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.021308E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.382 | TFLOPs: 40.72 | 15: iteration 48540/ 125429 | consumed samples: 12426240 | consumed tokens: 25448939520 | elapsed time per iteration (s): 1.04 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.014009E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.048 | TFLOPs: 40.50 | 15: iteration 48550/ 125429 | consumed samples: 12428800 | consumed tokens: 25454182400 | elapsed time per iteration (s): 1.06 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.039227E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.556 | TFLOPs: 39.92 | 15: iteration 48560/ 125429 | consumed samples: 12431360 | consumed tokens: 25459425280 | elapsed time per iteration (s): 1.05 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.026258E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.340 | TFLOPs: 40.21 | 15: iteration 48570/ 125429 | consumed samples: 12433920 | consumed tokens: 25464668160 | elapsed time per iteration (s): 1.03 | learning rate: 1.429E-04 | global batch size: 256 | lm loss: 2.034904E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.680 | TFLOPs: 41.10 | 15: iteration 48580/ 125429 | consumed samples: 12436480 | consumed tokens: 25469911040 | elapsed time per iteration (s): 1.05 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.005548E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.398 | TFLOPs: 40.22 | 15: iteration 48590/ 125429 | consumed samples: 12439040 | consumed tokens: 25475153920 | elapsed time per iteration (s): 1.07 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.037292E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.403 | TFLOPs: 39.56 | 15: iteration 48600/ 125429 | consumed samples: 12441600 | consumed tokens: 25480396800 | elapsed time per iteration (s): 1.16 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 1.998022E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.371 | TFLOPs: 36.58 | 15: iteration 48610/ 125429 | consumed samples: 12444160 | consumed tokens: 25485639680 | elapsed time per iteration (s): 1.14 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.011630E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.097 | TFLOPs: 37.03 | 15: iteration 48620/ 125429 | consumed samples: 12446720 | consumed tokens: 25490882560 | elapsed time per iteration (s): 1.12 | learning rate: 1.428E-04 | global batch size: 256 | lm loss: 2.042354E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.632 | TFLOPs: 37.78 | 15: iteration 48630/ 125429 | consumed samples: 12449280 | consumed tokens: 25496125440 | elapsed time per iteration (s): 1.05 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.032801E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.369 | TFLOPs: 40.22 | 15: iteration 48640/ 125429 | consumed samples: 12451840 | consumed tokens: 25501368320 | elapsed time per iteration (s): 1.04 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.019591E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.061 | TFLOPs: 40.66 | 15: iteration 48650/ 125429 | consumed samples: 12454400 | consumed tokens: 25506611200 | elapsed time per iteration (s): 1.03 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.016424E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.137 | TFLOPs: 41.17 | 15: iteration 48660/ 125429 | consumed samples: 12456960 | consumed tokens: 25511854080 | elapsed time per iteration (s): 1.07 | learning rate: 1.427E-04 | global batch size: 256 | lm loss: 2.049922E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.806 | TFLOPs: 39.46 | 15: iteration 48670/ 125429 | consumed samples: 12459520 | consumed tokens: 25517096960 | elapsed time per iteration (s): 1.07 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.027543E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.185 | TFLOPs: 39.36 | 15: iteration 48680/ 125429 | consumed samples: 12462080 | consumed tokens: 25522339840 | elapsed time per iteration (s): 1.03 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.009141E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.617 | TFLOPs: 41.25 | 15: iteration 48690/ 125429 | consumed samples: 12464640 | consumed tokens: 25527582720 | elapsed time per iteration (s): 1.14 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.014544E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.391 | TFLOPs: 37.25 | 15: iteration 48700/ 125429 | consumed samples: 12467200 | consumed tokens: 25532825600 | elapsed time per iteration (s): 1.06 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.016085E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.101 | TFLOPs: 39.84 | 15: iteration 48710/ 125429 | consumed samples: 12469760 | consumed tokens: 25538068480 | elapsed time per iteration (s): 1.06 | learning rate: 1.426E-04 | global batch size: 256 | lm loss: 2.013698E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.683 | TFLOPs: 39.94 | 15: iteration 48720/ 125429 | consumed samples: 12472320 | consumed tokens: 25543311360 | elapsed time per iteration (s): 1.19 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.045693E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.971 | TFLOPs: 35.69 | 15: iteration 48730/ 125429 | consumed samples: 12474880 | consumed tokens: 25548554240 | elapsed time per iteration (s): 1.07 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.038239E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.616 | TFLOPs: 39.43 | 15: iteration 48740/ 125429 | consumed samples: 12477440 | consumed tokens: 25553797120 | elapsed time per iteration (s): 1.03 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 1.987170E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.001 | TFLOPs: 41.15 | 15: iteration 48750/ 125429 | consumed samples: 12480000 | consumed tokens: 25559040000 | elapsed time per iteration (s): 1.04 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.021738E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.703 | TFLOPs: 40.77 | 15: iteration 48760/ 125429 | consumed samples: 12482560 | consumed tokens: 25564282880 | elapsed time per iteration (s): 1.42 | learning rate: 1.425E-04 | global batch size: 256 | lm loss: 2.016210E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.438 | TFLOPs: 29.82 | 15: iteration 48770/ 125429 | consumed samples: 12485120 | consumed tokens: 25569525760 | elapsed time per iteration (s): 1.08 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.013721E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.938 | TFLOPs: 39.32 | 15: iteration 48780/ 125429 | consumed samples: 12487680 | consumed tokens: 25574768640 | elapsed time per iteration (s): 1.13 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.037430E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.083 | TFLOPs: 37.53 | 15: iteration 48790/ 125429 | consumed samples: 12490240 | consumed tokens: 25580011520 | elapsed time per iteration (s): 1.04 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.018893E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.897 | TFLOPs: 40.80 | 15: iteration 48800/ 125429 | consumed samples: 12492800 | consumed tokens: 25585254400 | elapsed time per iteration (s): 1.02 | learning rate: 1.424E-04 | global batch size: 256 | lm loss: 2.035863E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.059 | TFLOPs: 41.49 | 15: iteration 48810/ 125429 | consumed samples: 12495360 | consumed tokens: 25590497280 | elapsed time per iteration (s): 1.05 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.023141E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.715 | TFLOPs: 40.44 | 15: iteration 48820/ 125429 | consumed samples: 12497920 | consumed tokens: 25595740160 | elapsed time per iteration (s): 1.05 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.027054E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.482 | TFLOPs: 40.24 | 15: iteration 48830/ 125429 | consumed samples: 12500480 | consumed tokens: 25600983040 | elapsed time per iteration (s): 1.12 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.046479E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.583 | TFLOPs: 37.94 | 15: iteration 48840/ 125429 | consumed samples: 12503040 | consumed tokens: 25606225920 | elapsed time per iteration (s): 1.04 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.005551E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.071 | TFLOPs: 40.67 | 15: iteration 48850/ 125429 | consumed samples: 12505600 | consumed tokens: 25611468800 | elapsed time per iteration (s): 1.04 | learning rate: 1.423E-04 | global batch size: 256 | lm loss: 2.023370E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.643 | TFLOPs: 40.59 | 15: iteration 48860/ 125429 | consumed samples: 12508160 | consumed tokens: 25616711680 | elapsed time per iteration (s): 1.03 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.011456E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.614 | TFLOPs: 40.92 | 15: iteration 48870/ 125429 | consumed samples: 12510720 | consumed tokens: 25621954560 | elapsed time per iteration (s): 1.03 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.000303E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.745 | TFLOPs: 40.94 | 15: iteration 48880/ 125429 | consumed samples: 12513280 | consumed tokens: 25627197440 | elapsed time per iteration (s): 1.04 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.018336E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.464 | TFLOPs: 40.73 | 15: iteration 48890/ 125429 | consumed samples: 12515840 | consumed tokens: 25632440320 | elapsed time per iteration (s): 1.05 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.021309E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.182 | TFLOPs: 40.19 | 15: iteration 48900/ 125429 | consumed samples: 12518400 | consumed tokens: 25637683200 | elapsed time per iteration (s): 1.03 | learning rate: 1.422E-04 | global batch size: 256 | lm loss: 2.012288E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.427 | TFLOPs: 41.05 | 15: iteration 48910/ 125429 | consumed samples: 12520960 | consumed tokens: 25642926080 | elapsed time per iteration (s): 1.04 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.038471E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.770 | TFLOPs: 40.78 | 15: iteration 48920/ 125429 | consumed samples: 12523520 | consumed tokens: 25648168960 | elapsed time per iteration (s): 1.07 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 1.997963E+00 | grad norm: 0.120 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.274 | TFLOPs: 39.71 | 15: iteration 48930/ 125429 | consumed samples: 12526080 | consumed tokens: 25653411840 | elapsed time per iteration (s): 1.04 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.033680E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.755 | TFLOPs: 40.61 | 15: iteration 48940/ 125429 | consumed samples: 12528640 | consumed tokens: 25658654720 | elapsed time per iteration (s): 1.04 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.000715E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.189 | TFLOPs: 40.52 | 15: iteration 48950/ 125429 | consumed samples: 12531200 | consumed tokens: 25663897600 | elapsed time per iteration (s): 1.06 | learning rate: 1.421E-04 | global batch size: 256 | lm loss: 2.014349E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.684 | TFLOPs: 39.77 | 15: iteration 48960/ 125429 | consumed samples: 12533760 | consumed tokens: 25669140480 | elapsed time per iteration (s): 1.04 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.034634E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.933 | TFLOPs: 40.81 | 15: iteration 48970/ 125429 | consumed samples: 12536320 | consumed tokens: 25674383360 | elapsed time per iteration (s): 1.03 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.006250E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.567 | TFLOPs: 41.24 | 15: iteration 48980/ 125429 | consumed samples: 12538880 | consumed tokens: 25679626240 | elapsed time per iteration (s): 1.05 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.058150E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.749 | TFLOPs: 40.45 | 15: iteration 48990/ 125429 | consumed samples: 12541440 | consumed tokens: 25684869120 | elapsed time per iteration (s): 1.02 | learning rate: 1.420E-04 | global batch size: 256 | lm loss: 2.031852E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.807 | TFLOPs: 41.28 | 15: iteration 49000/ 125429 | consumed samples: 12544000 | consumed tokens: 25690112000 | elapsed time per iteration (s): 1.04 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.008131E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.796 | TFLOPs: 40.78 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 49000 | lm loss value: 1.998795E+00 | lm loss PPL: 7.380159E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 49000 to checkpoints_1b5 0: [2022-11-26 10:25:45,633] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step49000 is begin to save! 0: [2022-11-26 10:25:45,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:25:45,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:25:45,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:25:46,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:25:46,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:25:46,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:25:46,134] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:25:46,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:25:46,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:25:46,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:25:46,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:25:46,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:25:46,462] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:25:46,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:25:46,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:25:46,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:25:46,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:25:46,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:25:46,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:25:46,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:25:46,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:25:46,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:25:47,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:25:47,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:25:47,105] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:25:47,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:25:47,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:25:47,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:25:47,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:25:47,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:25:47,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:25:47,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:25:47,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:25:47,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:25:47,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:25:47,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:25:47,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:25:47,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:25:47,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:25:47,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:25:47,920] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:25:48,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:25:48,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:25:48,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:25:48,123] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:25:48,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:25:48,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:25:48,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:25:48,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:25:48,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:25:48,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:25:48,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:25:48,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:25:48,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:25:48,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_29-model_00-model_states.pt... 0: [2022-11-26 10:25:48,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_29-model_00-model_states.pt. 0: [2022-11-26 10:25:48,730] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:25:48,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:25:48,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/layer_32-model_00-model_states.pt... 0: [2022-11-26 10:25:48,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/layer_32-model_00-model_states.pt. 0: [2022-11-26 10:25:48,836] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step49000/mp_rank_00_model_states.pt 0: [2022-11-26 10:25:48,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:25:48,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:25:49,025] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step49000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:25:49,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:25:49,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:25:49,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:25:49,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 10:25:49,195] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,195] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:25:49,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:25:49,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 10:25:49,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 10:25:49,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:25:49,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:25:49,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:25:49,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:25:49,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:25:49,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:25:49,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:25:49,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:25:49,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,216] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:25:49,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:25:49,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:25:49,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 10:25:49,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,218] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,218] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,218] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:25:49,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 10:25:49,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 10:25:49,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:25:49,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:25:49,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 10:25:49,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 10:25:49,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:25:49,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 4: [2022-11-26 10:25:49,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:25:49,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 10:25:49,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:25:49,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 5: [2022-11-26 10:25:49,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:25:49,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 10:25:49,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 11: [2022-11-26 10:25:49,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:25:49,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 10:25:49,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 10:25:49,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 3: [2022-11-26 10:25:49,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:25:49,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 10:25:49,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:25:49,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:25:49,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 6: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:25:49,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 7: [2022-11-26 10:25:49,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 10:25:49,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:25:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 10:25:49,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 2: [2022-11-26 10:25:49,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:25:49,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:25:49,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 9: [2022-11-26 10:25:49,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:25:49,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 10:25:49,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 10:25:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 1: [2022-11-26 10:25:49,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 8: [2022-11-26 10:25:49,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:25:49,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 10:25:49,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 10:25:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 14: [2022-11-26 10:25:49,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:25:49,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 10:25:49,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: [2022-11-26 10:25:49,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:25:49,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:25:49,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 10:25:49,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 10: [2022-11-26 10:25:49,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 15: [2022-11-26 10:25:49,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 12: [2022-11-26 10:25:49,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:25:49,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 10:25:49,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 13: [2022-11-26 10:25:49,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step49000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 10:25:49,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step49000 is ready now! 0: successfully saved checkpoint at iteration 49000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3882.15 15: iteration 49010/ 125429 | consumed samples: 12546560 | consumed tokens: 25695354880 | elapsed time per iteration (s): 1.44 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.012658E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.289 | TFLOPs: 29.46 | 15: iteration 49020/ 125429 | consumed samples: 12549120 | consumed tokens: 25700597760 | elapsed time per iteration (s): 1.04 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.021970E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.991 | TFLOPs: 40.65 | 15: iteration 49030/ 125429 | consumed samples: 12551680 | consumed tokens: 25705840640 | elapsed time per iteration (s): 1.04 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 2.037944E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.570 | TFLOPs: 40.58 | 15: iteration 49040/ 125429 | consumed samples: 12554240 | consumed tokens: 25711083520 | elapsed time per iteration (s): 1.03 | learning rate: 1.419E-04 | global batch size: 256 | lm loss: 1.997227E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.640 | TFLOPs: 40.92 | 15: iteration 49050/ 125429 | consumed samples: 12556800 | consumed tokens: 25716326400 | elapsed time per iteration (s): 1.03 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.032669E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.745 | TFLOPs: 40.94 | 15: iteration 49060/ 125429 | consumed samples: 12559360 | consumed tokens: 25721569280 | elapsed time per iteration (s): 1.07 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.026065E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.106 | TFLOPs: 39.68 | 15: iteration 49070/ 125429 | consumed samples: 12561920 | consumed tokens: 25726812160 | elapsed time per iteration (s): 1.03 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 1.990287E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.456 | TFLOPs: 41.06 | 15: iteration 49080/ 125429 | consumed samples: 12564480 | consumed tokens: 25732055040 | elapsed time per iteration (s): 1.02 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.032635E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.377 | TFLOPs: 41.38 | 15: iteration 49090/ 125429 | consumed samples: 12567040 | consumed tokens: 25737297920 | elapsed time per iteration (s): 1.02 | learning rate: 1.418E-04 | global batch size: 256 | lm loss: 2.056493E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.083 | TFLOPs: 41.33 | 15: iteration 49100/ 125429 | consumed samples: 12569600 | consumed tokens: 25742540800 | elapsed time per iteration (s): 1.04 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.026146E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.467 | TFLOPs: 40.73 | 15: iteration 49110/ 125429 | consumed samples: 12572160 | consumed tokens: 25747783680 | elapsed time per iteration (s): 1.03 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.000637E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.465 | TFLOPs: 41.23 | 15: iteration 49120/ 125429 | consumed samples: 12574720 | consumed tokens: 25753026560 | elapsed time per iteration (s): 1.04 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.013625E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.162 | TFLOPs: 40.85 | 15: iteration 49130/ 125429 | consumed samples: 12577280 | consumed tokens: 25758269440 | elapsed time per iteration (s): 1.04 | learning rate: 1.417E-04 | global batch size: 256 | lm loss: 2.037624E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.221 | TFLOPs: 40.52 | 15: iteration 49140/ 125429 | consumed samples: 12579840 | consumed tokens: 25763512320 | elapsed time per iteration (s): 1.05 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 1.998798E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.945 | TFLOPs: 40.31 | 15: iteration 49150/ 125429 | consumed samples: 12582400 | consumed tokens: 25768755200 | elapsed time per iteration (s): 1.06 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.026528E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.453 | TFLOPs: 39.90 | 15: iteration 49160/ 125429 | consumed samples: 12584960 | consumed tokens: 25773998080 | elapsed time per iteration (s): 1.03 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.016789E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.493 | TFLOPs: 40.90 | 15: iteration 49170/ 125429 | consumed samples: 12587520 | consumed tokens: 25779240960 | elapsed time per iteration (s): 1.04 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.011627E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.616 | TFLOPs: 40.76 | 15: iteration 49180/ 125429 | consumed samples: 12590080 | consumed tokens: 25784483840 | elapsed time per iteration (s): 1.05 | learning rate: 1.416E-04 | global batch size: 256 | lm loss: 2.038468E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.352 | TFLOPs: 40.38 | 15: iteration 49190/ 125429 | consumed samples: 12592640 | consumed tokens: 25789726720 | elapsed time per iteration (s): 1.03 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.013156E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.786 | TFLOPs: 40.95 | 15: iteration 49200/ 125429 | consumed samples: 12595200 | consumed tokens: 25794969600 | elapsed time per iteration (s): 1.06 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.019756E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.083 | TFLOPs: 40.01 | 15: iteration 49210/ 125429 | consumed samples: 12597760 | consumed tokens: 25800212480 | elapsed time per iteration (s): 1.03 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.044178E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.051 | TFLOPs: 40.99 | 15: iteration 49220/ 125429 | consumed samples: 12600320 | consumed tokens: 25805455360 | elapsed time per iteration (s): 1.02 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 1.993418E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.238 | TFLOPs: 41.52 | 15: iteration 49230/ 125429 | consumed samples: 12602880 | consumed tokens: 25810698240 | elapsed time per iteration (s): 1.04 | learning rate: 1.415E-04 | global batch size: 256 | lm loss: 2.038142E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.643 | TFLOPs: 40.59 | 15: iteration 49240/ 125429 | consumed samples: 12605440 | consumed tokens: 25815941120 | elapsed time per iteration (s): 1.02 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.021116E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.221 | TFLOPs: 41.35 | 15: iteration 49250/ 125429 | consumed samples: 12608000 | consumed tokens: 25821184000 | elapsed time per iteration (s): 1.02 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.000980E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.248 | TFLOPs: 41.36 | 15: iteration 49260/ 125429 | consumed samples: 12610560 | consumed tokens: 25826426880 | elapsed time per iteration (s): 1.03 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 2.015203E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.812 | TFLOPs: 41.12 | 15: iteration 49270/ 125429 | consumed samples: 12613120 | consumed tokens: 25831669760 | elapsed time per iteration (s): 1.05 | learning rate: 1.414E-04 | global batch size: 256 | lm loss: 1.975526E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.592 | TFLOPs: 40.42 | 15: iteration 49280/ 125429 | consumed samples: 12615680 | consumed tokens: 25836912640 | elapsed time per iteration (s): 1.03 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.045172E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.443 | TFLOPs: 41.22 | 15: iteration 49290/ 125429 | consumed samples: 12618240 | consumed tokens: 25842155520 | elapsed time per iteration (s): 1.03 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.019377E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.686 | TFLOPs: 41.26 | 15: iteration 49300/ 125429 | consumed samples: 12620800 | consumed tokens: 25847398400 | elapsed time per iteration (s): 1.03 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.040742E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.733 | TFLOPs: 40.94 | 15: iteration 49310/ 125429 | consumed samples: 12623360 | consumed tokens: 25852641280 | elapsed time per iteration (s): 1.02 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.018550E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.963 | TFLOPs: 41.31 | 15: iteration 49320/ 125429 | consumed samples: 12625920 | consumed tokens: 25857884160 | elapsed time per iteration (s): 1.06 | learning rate: 1.413E-04 | global batch size: 256 | lm loss: 2.040422E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.610 | TFLOPs: 40.09 | 15: iteration 49330/ 125429 | consumed samples: 12628480 | consumed tokens: 25863127040 | elapsed time per iteration (s): 1.02 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.026645E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.835 | TFLOPs: 41.29 | 15: iteration 49340/ 125429 | consumed samples: 12631040 | consumed tokens: 25868369920 | elapsed time per iteration (s): 1.05 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.014441E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.917 | TFLOPs: 40.31 | 15: iteration 49350/ 125429 | consumed samples: 12633600 | consumed tokens: 25873612800 | elapsed time per iteration (s): 1.02 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.018355E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.071 | TFLOPs: 41.66 | 15: iteration 49360/ 125429 | consumed samples: 12636160 | consumed tokens: 25878855680 | elapsed time per iteration (s): 1.02 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.035383E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.683 | TFLOPs: 41.59 | 15: iteration 49370/ 125429 | consumed samples: 12638720 | consumed tokens: 25884098560 | elapsed time per iteration (s): 1.02 | learning rate: 1.412E-04 | global batch size: 256 | lm loss: 2.026040E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.715 | TFLOPs: 41.43 | 15: iteration 49380/ 125429 | consumed samples: 12641280 | consumed tokens: 25889341440 | elapsed time per iteration (s): 1.04 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.038915E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.089 | TFLOPs: 40.50 | 15: iteration 49390/ 125429 | consumed samples: 12643840 | consumed tokens: 25894584320 | elapsed time per iteration (s): 1.05 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.032115E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.446 | TFLOPs: 40.40 | 15: iteration 49400/ 125429 | consumed samples: 12646400 | consumed tokens: 25899827200 | elapsed time per iteration (s): 1.03 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.018892E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.382 | TFLOPs: 41.05 | 15: iteration 49410/ 125429 | consumed samples: 12648960 | consumed tokens: 25905070080 | elapsed time per iteration (s): 1.05 | learning rate: 1.411E-04 | global batch size: 256 | lm loss: 2.021012E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.903 | TFLOPs: 40.31 | 15: iteration 49420/ 125429 | consumed samples: 12651520 | consumed tokens: 25910312960 | elapsed time per iteration (s): 1.04 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 1.992999E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.325 | TFLOPs: 40.71 | 15: iteration 49430/ 125429 | consumed samples: 12654080 | consumed tokens: 25915555840 | elapsed time per iteration (s): 1.03 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.049375E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.977 | TFLOPs: 41.15 | 15: iteration 49440/ 125429 | consumed samples: 12656640 | consumed tokens: 25920798720 | elapsed time per iteration (s): 1.03 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.046230E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.052 | TFLOPs: 40.99 | 15: iteration 49450/ 125429 | consumed samples: 12659200 | consumed tokens: 25926041600 | elapsed time per iteration (s): 1.02 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.012989E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.746 | TFLOPs: 41.44 | 15: iteration 49460/ 125429 | consumed samples: 12661760 | consumed tokens: 25931284480 | elapsed time per iteration (s): 1.04 | learning rate: 1.410E-04 | global batch size: 256 | lm loss: 2.018170E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.333 | TFLOPs: 40.87 | 15: iteration 49470/ 125429 | consumed samples: 12664320 | consumed tokens: 25936527360 | elapsed time per iteration (s): 1.07 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.028541E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.155 | TFLOPs: 39.52 | 15: iteration 49480/ 125429 | consumed samples: 12666880 | consumed tokens: 25941770240 | elapsed time per iteration (s): 1.06 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.013748E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.208 | TFLOPs: 39.86 | 15: iteration 49490/ 125429 | consumed samples: 12669440 | consumed tokens: 25947013120 | elapsed time per iteration (s): 1.04 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.023398E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.237 | TFLOPs: 40.69 | 15: iteration 49500/ 125429 | consumed samples: 12672000 | consumed tokens: 25952256000 | elapsed time per iteration (s): 1.03 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.042422E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.983 | TFLOPs: 40.98 | 15: iteration 49510/ 125429 | consumed samples: 12674560 | consumed tokens: 25957498880 | elapsed time per iteration (s): 1.03 | learning rate: 1.409E-04 | global batch size: 256 | lm loss: 2.044778E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.195 | TFLOPs: 41.18 | 15: iteration 49520/ 125429 | consumed samples: 12677120 | consumed tokens: 25962741760 | elapsed time per iteration (s): 1.08 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.017989E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.013 | TFLOPs: 39.17 | 15: iteration 49530/ 125429 | consumed samples: 12679680 | consumed tokens: 25967984640 | elapsed time per iteration (s): 1.02 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.028204E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.365 | TFLOPs: 41.54 | 15: iteration 49540/ 125429 | consumed samples: 12682240 | consumed tokens: 25973227520 | elapsed time per iteration (s): 1.02 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.022256E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.777 | TFLOPs: 41.44 | 15: iteration 49550/ 125429 | consumed samples: 12684800 | consumed tokens: 25978470400 | elapsed time per iteration (s): 1.06 | learning rate: 1.408E-04 | global batch size: 256 | lm loss: 2.032029E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.954 | TFLOPs: 39.98 | 15: iteration 49560/ 125429 | consumed samples: 12687360 | consumed tokens: 25983713280 | elapsed time per iteration (s): 1.03 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.029877E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.458 | TFLOPs: 41.22 | 15: iteration 49570/ 125429 | consumed samples: 12689920 | consumed tokens: 25988956160 | elapsed time per iteration (s): 1.03 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.003825E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.923 | TFLOPs: 41.14 | 15: iteration 49580/ 125429 | consumed samples: 12692480 | consumed tokens: 25994199040 | elapsed time per iteration (s): 1.02 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.032995E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.920 | TFLOPs: 41.30 | 15: iteration 49590/ 125429 | consumed samples: 12695040 | consumed tokens: 25999441920 | elapsed time per iteration (s): 1.04 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.011313E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.094 | TFLOPs: 40.67 | 15: iteration 49600/ 125429 | consumed samples: 12697600 | consumed tokens: 26004684800 | elapsed time per iteration (s): 1.04 | learning rate: 1.407E-04 | global batch size: 256 | lm loss: 2.030192E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.071 | TFLOPs: 40.83 | 15: iteration 49610/ 125429 | consumed samples: 12700160 | consumed tokens: 26009927680 | elapsed time per iteration (s): 1.02 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.017905E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.719 | TFLOPs: 41.43 | 15: iteration 49620/ 125429 | consumed samples: 12702720 | consumed tokens: 26015170560 | elapsed time per iteration (s): 1.03 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.002210E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.378 | TFLOPs: 41.05 | 15: iteration 49630/ 125429 | consumed samples: 12705280 | consumed tokens: 26020413440 | elapsed time per iteration (s): 1.05 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.049755E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.108 | TFLOPs: 40.18 | 15: iteration 49640/ 125429 | consumed samples: 12707840 | consumed tokens: 26025656320 | elapsed time per iteration (s): 1.06 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.011695E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.555 | TFLOPs: 39.75 | 15: iteration 49650/ 125429 | consumed samples: 12710400 | consumed tokens: 26030899200 | elapsed time per iteration (s): 1.03 | learning rate: 1.406E-04 | global batch size: 256 | lm loss: 2.057435E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.886 | TFLOPs: 41.13 | 15: iteration 49660/ 125429 | consumed samples: 12712960 | consumed tokens: 26036142080 | elapsed time per iteration (s): 1.03 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.027440E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.470 | TFLOPs: 41.23 | 15: iteration 49670/ 125429 | consumed samples: 12715520 | consumed tokens: 26041384960 | elapsed time per iteration (s): 1.03 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.023992E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.752 | TFLOPs: 41.11 | 15: iteration 49680/ 125429 | consumed samples: 12718080 | consumed tokens: 26046627840 | elapsed time per iteration (s): 1.03 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.030634E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.825 | TFLOPs: 41.12 | 15: iteration 49690/ 125429 | consumed samples: 12720640 | consumed tokens: 26051870720 | elapsed time per iteration (s): 1.04 | learning rate: 1.405E-04 | global batch size: 256 | lm loss: 2.023802E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.969 | TFLOPs: 40.81 | 15: iteration 49700/ 125429 | consumed samples: 12723200 | consumed tokens: 26057113600 | elapsed time per iteration (s): 1.02 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.007598E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.660 | TFLOPs: 41.42 | 15: iteration 49710/ 125429 | consumed samples: 12725760 | consumed tokens: 26062356480 | elapsed time per iteration (s): 1.03 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.014773E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.228 | TFLOPs: 41.19 | 15: iteration 49720/ 125429 | consumed samples: 12728320 | consumed tokens: 26067599360 | elapsed time per iteration (s): 1.03 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 2.043340E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.225 | TFLOPs: 41.19 | 15: iteration 49730/ 125429 | consumed samples: 12730880 | consumed tokens: 26072842240 | elapsed time per iteration (s): 1.02 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 1.990701E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.372 | TFLOPs: 41.38 | 15: iteration 49740/ 125429 | consumed samples: 12733440 | consumed tokens: 26078085120 | elapsed time per iteration (s): 1.03 | learning rate: 1.404E-04 | global batch size: 256 | lm loss: 1.990901E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.482 | TFLOPs: 41.06 | 15: iteration 49750/ 125429 | consumed samples: 12736000 | consumed tokens: 26083328000 | elapsed time per iteration (s): 1.04 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.030810E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.272 | TFLOPs: 40.86 | 15: iteration 49760/ 125429 | consumed samples: 12738560 | consumed tokens: 26088570880 | elapsed time per iteration (s): 1.04 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.018882E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.146 | TFLOPs: 40.68 | 15: iteration 49770/ 125429 | consumed samples: 12741120 | consumed tokens: 26093813760 | elapsed time per iteration (s): 1.02 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.011784E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.618 | TFLOPs: 41.58 | 15: iteration 49780/ 125429 | consumed samples: 12743680 | consumed tokens: 26099056640 | elapsed time per iteration (s): 1.06 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.024884E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.395 | TFLOPs: 40.06 | 15: iteration 49790/ 125429 | consumed samples: 12746240 | consumed tokens: 26104299520 | elapsed time per iteration (s): 1.11 | learning rate: 1.403E-04 | global batch size: 256 | lm loss: 2.061602E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.589 | TFLOPs: 38.27 | 15: iteration 49800/ 125429 | consumed samples: 12748800 | consumed tokens: 26109542400 | elapsed time per iteration (s): 1.02 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 1.990990E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.187 | TFLOPs: 41.35 | 15: iteration 49810/ 125429 | consumed samples: 12751360 | consumed tokens: 26114785280 | elapsed time per iteration (s): 1.03 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.023467E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.794 | TFLOPs: 41.12 | 15: iteration 49820/ 125429 | consumed samples: 12753920 | consumed tokens: 26120028160 | elapsed time per iteration (s): 1.02 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.029075E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.383 | TFLOPs: 41.54 | 15: iteration 49830/ 125429 | consumed samples: 12756480 | consumed tokens: 26125271040 | elapsed time per iteration (s): 1.03 | learning rate: 1.402E-04 | global batch size: 256 | lm loss: 2.018544E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.588 | TFLOPs: 41.08 | 15: iteration 49840/ 125429 | consumed samples: 12759040 | consumed tokens: 26130513920 | elapsed time per iteration (s): 1.04 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.035559E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.143 | TFLOPs: 40.68 | 15: iteration 49850/ 125429 | consumed samples: 12761600 | consumed tokens: 26135756800 | elapsed time per iteration (s): 1.09 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.012029E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.624 | TFLOPs: 38.94 | 15: iteration 49860/ 125429 | consumed samples: 12764160 | consumed tokens: 26140999680 | elapsed time per iteration (s): 1.03 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.012314E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.874 | TFLOPs: 40.96 | 15: iteration 49870/ 125429 | consumed samples: 12766720 | consumed tokens: 26146242560 | elapsed time per iteration (s): 1.04 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.041036E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.174 | TFLOPs: 40.85 | 15: iteration 49880/ 125429 | consumed samples: 12769280 | consumed tokens: 26151485440 | elapsed time per iteration (s): 1.06 | learning rate: 1.401E-04 | global batch size: 256 | lm loss: 2.011990E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.243 | TFLOPs: 40.03 | 15: iteration 49890/ 125429 | consumed samples: 12771840 | consumed tokens: 26156728320 | elapsed time per iteration (s): 1.02 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 1.995932E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.465 | TFLOPs: 41.56 | 15: iteration 49900/ 125429 | consumed samples: 12774400 | consumed tokens: 26161971200 | elapsed time per iteration (s): 1.05 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.022781E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.934 | TFLOPs: 40.15 | 15: iteration 49910/ 125429 | consumed samples: 12776960 | consumed tokens: 26167214080 | elapsed time per iteration (s): 1.03 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.043127E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.163 | TFLOPs: 41.18 | 15: iteration 49920/ 125429 | consumed samples: 12779520 | consumed tokens: 26172456960 | elapsed time per iteration (s): 1.02 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 2.011946E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.101 | TFLOPs: 41.33 | 15: iteration 49930/ 125429 | consumed samples: 12782080 | consumed tokens: 26177699840 | elapsed time per iteration (s): 1.04 | learning rate: 1.400E-04 | global batch size: 256 | lm loss: 1.990931E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.355 | TFLOPs: 40.71 | 15: iteration 49940/ 125429 | consumed samples: 12784640 | consumed tokens: 26182942720 | elapsed time per iteration (s): 1.05 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.017611E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.762 | TFLOPs: 40.28 | 15: iteration 49950/ 125429 | consumed samples: 12787200 | consumed tokens: 26188185600 | elapsed time per iteration (s): 1.03 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.031970E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.242 | TFLOPs: 41.19 | 15: iteration 49960/ 125429 | consumed samples: 12789760 | consumed tokens: 26193428480 | elapsed time per iteration (s): 1.04 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.052251E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.456 | TFLOPs: 40.56 | 15: iteration 49970/ 125429 | consumed samples: 12792320 | consumed tokens: 26198671360 | elapsed time per iteration (s): 1.04 | learning rate: 1.399E-04 | global batch size: 256 | lm loss: 2.033703E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.302 | TFLOPs: 40.87 | 15: iteration 49980/ 125429 | consumed samples: 12794880 | consumed tokens: 26203914240 | elapsed time per iteration (s): 1.05 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.013020E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.330 | TFLOPs: 40.21 | 15: iteration 49990/ 125429 | consumed samples: 12797440 | consumed tokens: 26209157120 | elapsed time per iteration (s): 1.03 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.027226E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.913 | TFLOPs: 40.97 | 0: [2022-11-26 10:43:06,391] [INFO] [logging.py:68:log_dist] [Rank 0] step=50000, skipped=0, lr=[0.0001398051262868933, 0.0001398051262868933, 0.0001398051262868933], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 50000/ 125429 | consumed samples: 12800000 | consumed tokens: 26214400000 | elapsed time per iteration (s): 1.06 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.011481E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.937 | TFLOPs: 39.82 | 0: steps: 50000 loss: 2.0274 iter time (s): 1.048 samples/sec: 244.216 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 50000 | lm loss value: 1.975261E+00 | lm loss PPL: 7.208501E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 50000 to checkpoints_1b5 0: [2022-11-26 10:43:06,739] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step50000 is begin to save! 0: [2022-11-26 10:43:06,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_01-model_00-model_states.pt... 0: [2022-11-26 10:43:07,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_01-model_00-model_states.pt. 0: [2022-11-26 10:43:07,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_03-model_00-model_states.pt... 0: [2022-11-26 10:43:07,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_03-model_00-model_states.pt. 0: [2022-11-26 10:43:07,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_04-model_00-model_states.pt... 0: [2022-11-26 10:43:07,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_04-model_00-model_states.pt. 0: [2022-11-26 10:43:07,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_05-model_00-model_states.pt... 0: [2022-11-26 10:43:07,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_05-model_00-model_states.pt. 0: [2022-11-26 10:43:07,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_06-model_00-model_states.pt... 0: [2022-11-26 10:43:07,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_06-model_00-model_states.pt. 0: [2022-11-26 10:43:07,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_07-model_00-model_states.pt... 0: [2022-11-26 10:43:07,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_07-model_00-model_states.pt. 0: [2022-11-26 10:43:07,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_08-model_00-model_states.pt... 0: [2022-11-26 10:43:07,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_08-model_00-model_states.pt. 0: [2022-11-26 10:43:07,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_09-model_00-model_states.pt... 0: [2022-11-26 10:43:07,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_09-model_00-model_states.pt. 0: [2022-11-26 10:43:07,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_10-model_00-model_states.pt... 0: [2022-11-26 10:43:07,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_10-model_00-model_states.pt. 0: [2022-11-26 10:43:07,967] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_11-model_00-model_states.pt... 0: [2022-11-26 10:43:08,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_11-model_00-model_states.pt. 0: [2022-11-26 10:43:08,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_12-model_00-model_states.pt... 0: [2022-11-26 10:43:08,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_12-model_00-model_states.pt. 0: [2022-11-26 10:43:08,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_13-model_00-model_states.pt... 0: [2022-11-26 10:43:08,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_13-model_00-model_states.pt. 0: [2022-11-26 10:43:08,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_14-model_00-model_states.pt... 0: [2022-11-26 10:43:08,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_14-model_00-model_states.pt. 0: [2022-11-26 10:43:08,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_15-model_00-model_states.pt... 0: [2022-11-26 10:43:08,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_15-model_00-model_states.pt. 0: [2022-11-26 10:43:08,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_16-model_00-model_states.pt... 0: [2022-11-26 10:43:08,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_16-model_00-model_states.pt. 0: [2022-11-26 10:43:08,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_17-model_00-model_states.pt... 0: [2022-11-26 10:43:08,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_17-model_00-model_states.pt. 0: [2022-11-26 10:43:08,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_18-model_00-model_states.pt... 0: [2022-11-26 10:43:08,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_18-model_00-model_states.pt. 0: [2022-11-26 10:43:08,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_19-model_00-model_states.pt... 0: [2022-11-26 10:43:09,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_19-model_00-model_states.pt. 0: [2022-11-26 10:43:09,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_20-model_00-model_states.pt... 0: [2022-11-26 10:43:09,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_20-model_00-model_states.pt. 0: [2022-11-26 10:43:09,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_21-model_00-model_states.pt... 0: [2022-11-26 10:43:09,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_21-model_00-model_states.pt. 0: [2022-11-26 10:43:09,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_22-model_00-model_states.pt... 0: [2022-11-26 10:43:09,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_22-model_00-model_states.pt. 0: [2022-11-26 10:43:09,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_23-model_00-model_states.pt... 0: [2022-11-26 10:43:09,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_23-model_00-model_states.pt. 0: [2022-11-26 10:43:09,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_24-model_00-model_states.pt... 0: [2022-11-26 10:43:09,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_24-model_00-model_states.pt. 0: [2022-11-26 10:43:09,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_25-model_00-model_states.pt... 0: [2022-11-26 10:43:09,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_25-model_00-model_states.pt. 0: [2022-11-26 10:43:09,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_26-model_00-model_states.pt... 0: [2022-11-26 10:43:09,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_26-model_00-model_states.pt. 0: [2022-11-26 10:43:09,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_27-model_00-model_states.pt... 0: [2022-11-26 10:43:10,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_27-model_00-model_states.pt. 0: [2022-11-26 10:43:10,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_28-model_00-model_states.pt... 0: [2022-11-26 10:43:10,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_28-model_00-model_states.pt. 0: [2022-11-26 10:43:10,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_29-model_00-model_states.pt... 0: [2022-11-26 10:43:10,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_29-model_00-model_states.pt. 0: [2022-11-26 10:43:10,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_30-model_00-model_states.pt... 0: [2022-11-26 10:43:10,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_30-model_00-model_states.pt. 0: [2022-11-26 10:43:10,370] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/layer_32-model_00-model_states.pt... 0: [2022-11-26 10:43:10,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/layer_32-model_00-model_states.pt. 0: [2022-11-26 10:43:10,376] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step50000/mp_rank_00_model_states.pt 0: [2022-11-26 10:43:10,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/mp_rank_00_model_states.pt... 0: [2022-11-26 10:43:10,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/mp_rank_00_model_states.pt. 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 1: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 9: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 13: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 10:43:10,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step50000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 12: [2022-11-26 10:43:10,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 10:43:10,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 10:43:10,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 10:43:10,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 10:43:10,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 10:43:10,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 10:43:10,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 10:43:10,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 10:43:10,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:43:10,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 10:43:10,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 10:43:10,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 10:43:10,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 10:43:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 10:43:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:43:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 10:43:10,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:43:10,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 10:43:10,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 10:43:10,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 10:43:10,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 10:43:10,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 10:43:10,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:43:10,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:43:10,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 10:43:10,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 10:43:10,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 10:43:10,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 10:43:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:43:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:43:10,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:43:10,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:43:10,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 10:43:10,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:43:10,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:43:10,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:43:10,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:43:10,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:43:10,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 11: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 10:43:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:43:10,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 10:43:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 10:43:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 10:43:10,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 10:43:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 10:43:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 10:43:10,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 10:43:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 9: [2022-11-26 10:43:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 10:43:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 10:43:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:43:10,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 10:43:10,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:43:10,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 10:43:10,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 10:43:10,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 7: [2022-11-26 10:43:10,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 10:43:10,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 10:43:10,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 10:43:10,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 10:43:10,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 10:43:10,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 10:43:10,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 14: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 6: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 14: [2022-11-26 10:43:10,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:43:10,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 10:43:10,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:43:10,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 6: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 10:43:10,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 10:43:10,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 5: [2022-11-26 10:43:10,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 10:43:10,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 10:43:10,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:43:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:43:10,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:43:10,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 13: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 10:43:10,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 10: [2022-11-26 10:43:10,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 10:43:10,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 10:43:10,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 10:43:10,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:43:10,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 10:43:10,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 10:43:10,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 3: [2022-11-26 10:43:10,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 15: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 10:43:10,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 10:43:10,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 8: [2022-11-26 10:43:10,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 10:43:10,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 10:43:10,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 4: [2022-11-26 10:43:10,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 10:43:10,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 10:43:10,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: [2022-11-26 10:43:10,650] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 10:43:10,650] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 12: [2022-11-26 10:43:10,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 10:43:10,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 10:43:10,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 1: [2022-11-26 10:43:10,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 10:43:10,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 10:43:10,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step50000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 2: [2022-11-26 10:43:10,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step50000 is ready now! 0: successfully saved checkpoint at iteration 50000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3992.77 15: iteration 50010/ 125429 | consumed samples: 12802560 | consumed tokens: 26219642880 | elapsed time per iteration (s): 1.49 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.025216E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.736 | TFLOPs: 28.38 | 15: iteration 50020/ 125429 | consumed samples: 12805120 | consumed tokens: 26224885760 | elapsed time per iteration (s): 1.05 | learning rate: 1.398E-04 | global batch size: 256 | lm loss: 2.028773E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.298 | TFLOPs: 40.21 | 15: iteration 50030/ 125429 | consumed samples: 12807680 | consumed tokens: 26230128640 | elapsed time per iteration (s): 1.04 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.003742E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.217 | TFLOPs: 40.52 | 15: iteration 50040/ 125429 | consumed samples: 12810240 | consumed tokens: 26235371520 | elapsed time per iteration (s): 1.04 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.010974E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.489 | TFLOPs: 40.57 | 15: iteration 50050/ 125429 | consumed samples: 12812800 | consumed tokens: 26240614400 | elapsed time per iteration (s): 1.08 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.001874E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.208 | TFLOPs: 39.04 | 15: iteration 50060/ 125429 | consumed samples: 12815360 | consumed tokens: 26245857280 | elapsed time per iteration (s): 1.03 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.034513E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.212 | TFLOPs: 41.18 | 15: iteration 50070/ 125429 | consumed samples: 12817920 | consumed tokens: 26251100160 | elapsed time per iteration (s): 1.02 | learning rate: 1.397E-04 | global batch size: 256 | lm loss: 2.006444E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.414 | TFLOPs: 41.38 | 15: iteration 50080/ 125429 | consumed samples: 12820480 | consumed tokens: 26256343040 | elapsed time per iteration (s): 1.04 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 2.004954E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.135 | TFLOPs: 40.51 | 15: iteration 50090/ 125429 | consumed samples: 12823040 | consumed tokens: 26261585920 | elapsed time per iteration (s): 1.02 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 2.065910E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.851 | TFLOPs: 41.46 | 15: iteration 50100/ 125429 | consumed samples: 12825600 | consumed tokens: 26266828800 | elapsed time per iteration (s): 1.04 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 2.022598E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.288 | TFLOPs: 40.87 | 15: iteration 50110/ 125429 | consumed samples: 12828160 | consumed tokens: 26272071680 | elapsed time per iteration (s): 1.03 | learning rate: 1.396E-04 | global batch size: 256 | lm loss: 1.990172E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.416 | TFLOPs: 41.22 | 15: iteration 50120/ 125429 | consumed samples: 12830720 | consumed tokens: 26277314560 | elapsed time per iteration (s): 1.02 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.003178E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.598 | TFLOPs: 41.41 | 15: iteration 50130/ 125429 | consumed samples: 12833280 | consumed tokens: 26282557440 | elapsed time per iteration (s): 1.04 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.041938E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.731 | TFLOPs: 40.61 | 15: iteration 50140/ 125429 | consumed samples: 12835840 | consumed tokens: 26287800320 | elapsed time per iteration (s): 1.05 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.014436E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.783 | TFLOPs: 40.45 | 15: iteration 50150/ 125429 | consumed samples: 12838400 | consumed tokens: 26293043200 | elapsed time per iteration (s): 1.05 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.027030E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.157 | TFLOPs: 40.18 | 15: iteration 50160/ 125429 | consumed samples: 12840960 | consumed tokens: 26298286080 | elapsed time per iteration (s): 1.15 | learning rate: 1.395E-04 | global batch size: 256 | lm loss: 2.012768E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.861 | TFLOPs: 36.66 | 15: iteration 50170/ 125429 | consumed samples: 12843520 | consumed tokens: 26303528960 | elapsed time per iteration (s): 1.03 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 1.991714E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.463 | TFLOPs: 41.23 | 15: iteration 50180/ 125429 | consumed samples: 12846080 | consumed tokens: 26308771840 | elapsed time per iteration (s): 1.03 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.009548E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.237 | TFLOPs: 41.19 | 15: iteration 50190/ 125429 | consumed samples: 12848640 | consumed tokens: 26314014720 | elapsed time per iteration (s): 1.04 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 1.995678E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.900 | TFLOPs: 40.80 | 15: iteration 50200/ 125429 | consumed samples: 12851200 | consumed tokens: 26319257600 | elapsed time per iteration (s): 1.04 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.021051E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.139 | TFLOPs: 40.84 | 15: iteration 50210/ 125429 | consumed samples: 12853760 | consumed tokens: 26324500480 | elapsed time per iteration (s): 1.03 | learning rate: 1.394E-04 | global batch size: 256 | lm loss: 2.013128E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.702 | TFLOPs: 41.10 | 15: iteration 50220/ 125429 | consumed samples: 12856320 | consumed tokens: 26329743360 | elapsed time per iteration (s): 1.03 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.007116E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.881 | TFLOPs: 40.96 | 15: iteration 50230/ 125429 | consumed samples: 12858880 | consumed tokens: 26334986240 | elapsed time per iteration (s): 1.03 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.031888E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.949 | TFLOPs: 40.98 | 15: iteration 50240/ 125429 | consumed samples: 12861440 | consumed tokens: 26340229120 | elapsed time per iteration (s): 1.05 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.018559E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.797 | TFLOPs: 40.29 | 15: iteration 50250/ 125429 | consumed samples: 12864000 | consumed tokens: 26345472000 | elapsed time per iteration (s): 1.05 | learning rate: 1.393E-04 | global batch size: 256 | lm loss: 2.027319E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.883 | TFLOPs: 40.14 | 15: iteration 50260/ 125429 | consumed samples: 12866560 | consumed tokens: 26350714880 | elapsed time per iteration (s): 1.02 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.000919E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.369 | TFLOPs: 41.54 | 15: iteration 50270/ 125429 | consumed samples: 12869120 | consumed tokens: 26355957760 | elapsed time per iteration (s): 1.02 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.020222E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.996 | TFLOPs: 41.31 | 15: iteration 50280/ 125429 | consumed samples: 12871680 | consumed tokens: 26361200640 | elapsed time per iteration (s): 1.07 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.033040E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.144 | TFLOPs: 39.52 | 15: iteration 50290/ 125429 | consumed samples: 12874240 | consumed tokens: 26366443520 | elapsed time per iteration (s): 1.02 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 2.017920E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.525 | TFLOPs: 41.40 | 15: iteration 50300/ 125429 | consumed samples: 12876800 | consumed tokens: 26371686400 | elapsed time per iteration (s): 1.03 | learning rate: 1.392E-04 | global batch size: 256 | lm loss: 1.993904E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.619 | TFLOPs: 41.09 | 15: iteration 50310/ 125429 | consumed samples: 12879360 | consumed tokens: 26376929280 | elapsed time per iteration (s): 1.04 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.019577E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.036 | TFLOPs: 40.82 | 15: iteration 50320/ 125429 | consumed samples: 12881920 | consumed tokens: 26382172160 | elapsed time per iteration (s): 1.06 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.030434E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.591 | TFLOPs: 40.09 | 15: iteration 50330/ 125429 | consumed samples: 12884480 | consumed tokens: 26387415040 | elapsed time per iteration (s): 1.06 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 1.996594E+00 | grad norm: 0.219 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.822 | TFLOPs: 39.96 | 15: iteration 50340/ 125429 | consumed samples: 12887040 | consumed tokens: 26392657920 | elapsed time per iteration (s): 1.05 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.006672E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.180 | TFLOPs: 40.19 | 15: iteration 50350/ 125429 | consumed samples: 12889600 | consumed tokens: 26397900800 | elapsed time per iteration (s): 1.05 | learning rate: 1.391E-04 | global batch size: 256 | lm loss: 2.001454E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.381 | TFLOPs: 40.22 | 15: iteration 50360/ 125429 | consumed samples: 12892160 | consumed tokens: 26403143680 | elapsed time per iteration (s): 1.06 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.051637E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.212 | TFLOPs: 40.03 | 15: iteration 50370/ 125429 | consumed samples: 12894720 | consumed tokens: 26408386560 | elapsed time per iteration (s): 1.04 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.022181E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.718 | TFLOPs: 40.77 | 15: iteration 50380/ 125429 | consumed samples: 12897280 | consumed tokens: 26413629440 | elapsed time per iteration (s): 1.03 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.055980E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.111 | TFLOPs: 41.17 | 15: iteration 50390/ 125429 | consumed samples: 12899840 | consumed tokens: 26418872320 | elapsed time per iteration (s): 1.04 | learning rate: 1.390E-04 | global batch size: 256 | lm loss: 2.008394E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.572 | TFLOPs: 40.58 | 15: iteration 50400/ 125429 | consumed samples: 12902400 | consumed tokens: 26424115200 | elapsed time per iteration (s): 1.03 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.034159E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.143 | TFLOPs: 41.17 | 15: iteration 50410/ 125429 | consumed samples: 12904960 | consumed tokens: 26429358080 | elapsed time per iteration (s): 1.04 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.024839E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.781 | TFLOPs: 40.62 | 15: iteration 50420/ 125429 | consumed samples: 12907520 | consumed tokens: 26434600960 | elapsed time per iteration (s): 1.03 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.027555E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.891 | TFLOPs: 41.13 | 15: iteration 50430/ 125429 | consumed samples: 12910080 | consumed tokens: 26439843840 | elapsed time per iteration (s): 1.04 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.034295E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.420 | TFLOPs: 40.72 | 15: iteration 50440/ 125429 | consumed samples: 12912640 | consumed tokens: 26445086720 | elapsed time per iteration (s): 1.03 | learning rate: 1.389E-04 | global batch size: 256 | lm loss: 2.019005E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.841 | TFLOPs: 40.96 | 15: iteration 50450/ 125429 | consumed samples: 12915200 | consumed tokens: 26450329600 | elapsed time per iteration (s): 1.04 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.011261E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.687 | TFLOPs: 40.60 | 15: iteration 50460/ 125429 | consumed samples: 12917760 | consumed tokens: 26455572480 | elapsed time per iteration (s): 1.04 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.040567E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.096 | TFLOPs: 40.83 | 15: iteration 50470/ 125429 | consumed samples: 12920320 | consumed tokens: 26460815360 | elapsed time per iteration (s): 1.04 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.048125E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.638 | TFLOPs: 40.76 | 15: iteration 50480/ 125429 | consumed samples: 12922880 | consumed tokens: 26466058240 | elapsed time per iteration (s): 1.05 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.042490E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.327 | TFLOPs: 40.38 | 15: iteration 50490/ 125429 | consumed samples: 12925440 | consumed tokens: 26471301120 | elapsed time per iteration (s): 1.03 | learning rate: 1.388E-04 | global batch size: 256 | lm loss: 2.007146E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.677 | TFLOPs: 40.93 | 15: iteration 50500/ 125429 | consumed samples: 12928000 | consumed tokens: 26476544000 | elapsed time per iteration (s): 1.03 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.005005E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.905 | TFLOPs: 41.13 | 15: iteration 50510/ 125429 | consumed samples: 12930560 | consumed tokens: 26481786880 | elapsed time per iteration (s): 1.07 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.022161E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.387 | TFLOPs: 39.40 | 15: iteration 50520/ 125429 | consumed samples: 12933120 | consumed tokens: 26487029760 | elapsed time per iteration (s): 1.06 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.027478E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.256 | TFLOPs: 39.87 | 15: iteration 50530/ 125429 | consumed samples: 12935680 | consumed tokens: 26492272640 | elapsed time per iteration (s): 1.04 | learning rate: 1.387E-04 | global batch size: 256 | lm loss: 2.045599E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.327 | TFLOPs: 40.54 | 15: iteration 50540/ 125429 | consumed samples: 12938240 | consumed tokens: 26497515520 | elapsed time per iteration (s): 1.05 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.005008E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.839 | TFLOPs: 40.30 | 15: iteration 50550/ 125429 | consumed samples: 12940800 | consumed tokens: 26502758400 | elapsed time per iteration (s): 1.08 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 1.997902E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.122 | TFLOPs: 39.35 | 15: iteration 50560/ 125429 | consumed samples: 12943360 | consumed tokens: 26508001280 | elapsed time per iteration (s): 1.06 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.025439E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.228 | TFLOPs: 40.03 | 15: iteration 50570/ 125429 | consumed samples: 12945920 | consumed tokens: 26513244160 | elapsed time per iteration (s): 1.04 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.024803E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.013 | TFLOPs: 40.66 | 15: iteration 50580/ 125429 | consumed samples: 12948480 | consumed tokens: 26518487040 | elapsed time per iteration (s): 1.04 | learning rate: 1.386E-04 | global batch size: 256 | lm loss: 2.015051E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.981 | TFLOPs: 40.48 | 15: iteration 50590/ 125429 | consumed samples: 12951040 | consumed tokens: 26523729920 | elapsed time per iteration (s): 1.06 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.032127E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.443 | TFLOPs: 39.90 | 15: iteration 50600/ 125429 | consumed samples: 12953600 | consumed tokens: 26528972800 | elapsed time per iteration (s): 1.05 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.037300E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.779 | TFLOPs: 40.12 | 15: iteration 50610/ 125429 | consumed samples: 12956160 | consumed tokens: 26534215680 | elapsed time per iteration (s): 1.07 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.008384E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.177 | TFLOPs: 39.53 | 15: iteration 50620/ 125429 | consumed samples: 12958720 | consumed tokens: 26539458560 | elapsed time per iteration (s): 1.11 | learning rate: 1.385E-04 | global batch size: 256 | lm loss: 2.021167E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.871 | TFLOPs: 37.99 | 15: iteration 50630/ 125429 | consumed samples: 12961280 | consumed tokens: 26544701440 | elapsed time per iteration (s): 1.06 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.020401E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.866 | TFLOPs: 39.80 | 15: iteration 50640/ 125429 | consumed samples: 12963840 | consumed tokens: 26549944320 | elapsed time per iteration (s): 1.07 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.033255E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.522 | TFLOPs: 39.58 | 15: iteration 50650/ 125429 | consumed samples: 12966400 | consumed tokens: 26555187200 | elapsed time per iteration (s): 1.06 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.024214E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.856 | TFLOPs: 39.97 | 15: iteration 50660/ 125429 | consumed samples: 12968960 | consumed tokens: 26560430080 | elapsed time per iteration (s): 1.04 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.007741E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.067 | TFLOPs: 40.66 | 15: iteration 50670/ 125429 | consumed samples: 12971520 | consumed tokens: 26565672960 | elapsed time per iteration (s): 1.08 | learning rate: 1.384E-04 | global batch size: 256 | lm loss: 2.015265E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.014 | TFLOPs: 39.17 | 15: iteration 50680/ 125429 | consumed samples: 12974080 | consumed tokens: 26570915840 | elapsed time per iteration (s): 1.05 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.012232E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.327 | TFLOPs: 40.21 | 15: iteration 50690/ 125429 | consumed samples: 12976640 | consumed tokens: 26576158720 | elapsed time per iteration (s): 1.06 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 1.989259E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.724 | TFLOPs: 39.95 | 15: iteration 50700/ 125429 | consumed samples: 12979200 | consumed tokens: 26581401600 | elapsed time per iteration (s): 1.08 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.005170E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.967 | TFLOPs: 39.16 | 15: iteration 50710/ 125429 | consumed samples: 12981760 | consumed tokens: 26586644480 | elapsed time per iteration (s): 1.07 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.012069E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.751 | TFLOPs: 39.62 | 15: iteration 50720/ 125429 | consumed samples: 12984320 | consumed tokens: 26591887360 | elapsed time per iteration (s): 1.03 | learning rate: 1.383E-04 | global batch size: 256 | lm loss: 2.025604E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.598 | TFLOPs: 41.25 | 15: iteration 50730/ 125429 | consumed samples: 12986880 | consumed tokens: 26597130240 | elapsed time per iteration (s): 1.04 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.000186E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.907 | TFLOPs: 40.80 | 15: iteration 50740/ 125429 | consumed samples: 12989440 | consumed tokens: 26602373120 | elapsed time per iteration (s): 1.04 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.017584E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.413 | TFLOPs: 40.56 | 15: iteration 50750/ 125429 | consumed samples: 12992000 | consumed tokens: 26607616000 | elapsed time per iteration (s): 1.04 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.000011E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.305 | TFLOPs: 40.87 | 15: iteration 50760/ 125429 | consumed samples: 12994560 | consumed tokens: 26612858880 | elapsed time per iteration (s): 1.06 | learning rate: 1.382E-04 | global batch size: 256 | lm loss: 2.034510E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.381 | TFLOPs: 39.89 | 15: iteration 50770/ 125429 | consumed samples: 12997120 | consumed tokens: 26618101760 | elapsed time per iteration (s): 1.07 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.038831E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.182 | TFLOPs: 39.69 | 15: iteration 50780/ 125429 | consumed samples: 12999680 | consumed tokens: 26623344640 | elapsed time per iteration (s): 1.06 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.014251E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.484 | TFLOPs: 39.91 | 15: iteration 50790/ 125429 | consumed samples: 13002240 | consumed tokens: 26628587520 | elapsed time per iteration (s): 1.05 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.018986E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.515 | TFLOPs: 40.41 | 15: iteration 50800/ 125429 | consumed samples: 13004800 | consumed tokens: 26633830400 | elapsed time per iteration (s): 1.03 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.022177E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.420 | TFLOPs: 41.22 | 15: iteration 50810/ 125429 | consumed samples: 13007360 | consumed tokens: 26639073280 | elapsed time per iteration (s): 1.03 | learning rate: 1.381E-04 | global batch size: 256 | lm loss: 2.030364E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.716 | TFLOPs: 41.10 | 15: iteration 50820/ 125429 | consumed samples: 13009920 | consumed tokens: 26644316160 | elapsed time per iteration (s): 1.04 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.011931E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.032 | TFLOPs: 40.66 | 15: iteration 50830/ 125429 | consumed samples: 13012480 | consumed tokens: 26649559040 | elapsed time per iteration (s): 1.04 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.023653E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.183 | TFLOPs: 40.68 | 15: iteration 50840/ 125429 | consumed samples: 13015040 | consumed tokens: 26654801920 | elapsed time per iteration (s): 1.05 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.025764E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.262 | TFLOPs: 40.37 | 15: iteration 50850/ 125429 | consumed samples: 13017600 | consumed tokens: 26660044800 | elapsed time per iteration (s): 1.05 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.017415E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.848 | TFLOPs: 40.13 | 15: iteration 50860/ 125429 | consumed samples: 13020160 | consumed tokens: 26665287680 | elapsed time per iteration (s): 1.04 | learning rate: 1.380E-04 | global batch size: 256 | lm loss: 2.024937E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.905 | TFLOPs: 40.80 | 15: iteration 50870/ 125429 | consumed samples: 13022720 | consumed tokens: 26670530560 | elapsed time per iteration (s): 1.04 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.014020E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.888 | TFLOPs: 40.80 | 15: iteration 50880/ 125429 | consumed samples: 13025280 | consumed tokens: 26675773440 | elapsed time per iteration (s): 1.03 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 2.014104E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.132 | TFLOPs: 41.01 | 15: iteration 50890/ 125429 | consumed samples: 13027840 | consumed tokens: 26681016320 | elapsed time per iteration (s): 1.03 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 1.987605E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.465 | TFLOPs: 41.23 | 15: iteration 50900/ 125429 | consumed samples: 13030400 | consumed tokens: 26686259200 | elapsed time per iteration (s): 1.04 | learning rate: 1.379E-04 | global batch size: 256 | lm loss: 1.980544E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.612 | TFLOPs: 40.59 | 15: iteration 50910/ 125429 | consumed samples: 13032960 | consumed tokens: 26691502080 | elapsed time per iteration (s): 1.02 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.007938E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.040 | TFLOPs: 41.32 | 15: iteration 50920/ 125429 | consumed samples: 13035520 | consumed tokens: 26696744960 | elapsed time per iteration (s): 1.06 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.013544E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.575 | TFLOPs: 40.09 | 15: iteration 50930/ 125429 | consumed samples: 13038080 | consumed tokens: 26701987840 | elapsed time per iteration (s): 1.03 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.033367E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.631 | TFLOPs: 41.25 | 15: iteration 50940/ 125429 | consumed samples: 13040640 | consumed tokens: 26707230720 | elapsed time per iteration (s): 1.12 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.055189E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.408 | TFLOPs: 37.91 | 15: iteration 50950/ 125429 | consumed samples: 13043200 | consumed tokens: 26712473600 | elapsed time per iteration (s): 1.04 | learning rate: 1.378E-04 | global batch size: 256 | lm loss: 2.017913E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.795 | TFLOPs: 40.78 | 15: iteration 50960/ 125429 | consumed samples: 13045760 | consumed tokens: 26717716480 | elapsed time per iteration (s): 1.07 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.007775E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.431 | TFLOPs: 39.57 | 15: iteration 50970/ 125429 | consumed samples: 13048320 | consumed tokens: 26722959360 | elapsed time per iteration (s): 1.05 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.041797E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.076 | TFLOPs: 40.34 | 15: iteration 50980/ 125429 | consumed samples: 13050880 | consumed tokens: 26728202240 | elapsed time per iteration (s): 1.07 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 2.038249E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.102 | TFLOPs: 39.68 | 15: iteration 50990/ 125429 | consumed samples: 13053440 | consumed tokens: 26733445120 | elapsed time per iteration (s): 1.03 | learning rate: 1.377E-04 | global batch size: 256 | lm loss: 1.992428E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.501 | TFLOPs: 41.07 | 15: iteration 51000/ 125429 | consumed samples: 13056000 | consumed tokens: 26738688000 | elapsed time per iteration (s): 1.03 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.006980E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.316 | TFLOPs: 41.04 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 51000 | lm loss value: 1.922098E+00 | lm loss PPL: 6.835286E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 51000 to checkpoints_1b5 0: [2022-11-26 11:00:37,888] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step51000 is begin to save! 0: [2022-11-26 11:00:37,897] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_01-model_00-model_states.pt... 0: [2022-11-26 11:00:38,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_01-model_00-model_states.pt. 0: [2022-11-26 11:00:38,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_03-model_00-model_states.pt... 0: [2022-11-26 11:00:38,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_03-model_00-model_states.pt. 0: [2022-11-26 11:00:38,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_04-model_00-model_states.pt... 0: [2022-11-26 11:00:38,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_04-model_00-model_states.pt. 0: [2022-11-26 11:00:38,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_05-model_00-model_states.pt... 0: [2022-11-26 11:00:38,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_05-model_00-model_states.pt. 0: [2022-11-26 11:00:38,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_06-model_00-model_states.pt... 0: [2022-11-26 11:00:38,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_06-model_00-model_states.pt. 0: [2022-11-26 11:00:38,589] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_07-model_00-model_states.pt... 0: [2022-11-26 11:00:38,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_07-model_00-model_states.pt. 0: [2022-11-26 11:00:38,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_08-model_00-model_states.pt... 0: [2022-11-26 11:00:38,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_08-model_00-model_states.pt. 0: [2022-11-26 11:00:38,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_09-model_00-model_states.pt... 0: [2022-11-26 11:00:38,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_09-model_00-model_states.pt. 0: [2022-11-26 11:00:38,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_10-model_00-model_states.pt... 0: [2022-11-26 11:00:39,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_10-model_00-model_states.pt. 0: [2022-11-26 11:00:39,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_11-model_00-model_states.pt... 0: [2022-11-26 11:00:39,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_11-model_00-model_states.pt. 0: [2022-11-26 11:00:39,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_12-model_00-model_states.pt... 0: [2022-11-26 11:00:39,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_12-model_00-model_states.pt. 0: [2022-11-26 11:00:39,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_13-model_00-model_states.pt... 0: [2022-11-26 11:00:39,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_13-model_00-model_states.pt. 0: [2022-11-26 11:00:39,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_14-model_00-model_states.pt... 0: [2022-11-26 11:00:39,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_14-model_00-model_states.pt. 0: [2022-11-26 11:00:39,465] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_15-model_00-model_states.pt... 0: [2022-11-26 11:00:39,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_15-model_00-model_states.pt. 0: [2022-11-26 11:00:39,578] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_16-model_00-model_states.pt... 0: [2022-11-26 11:00:39,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_16-model_00-model_states.pt. 0: [2022-11-26 11:00:39,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_17-model_00-model_states.pt... 0: [2022-11-26 11:00:39,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_17-model_00-model_states.pt. 0: [2022-11-26 11:00:39,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_18-model_00-model_states.pt... 0: [2022-11-26 11:00:39,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_18-model_00-model_states.pt. 0: [2022-11-26 11:00:39,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_19-model_00-model_states.pt... 0: [2022-11-26 11:00:40,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_19-model_00-model_states.pt. 0: [2022-11-26 11:00:40,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_20-model_00-model_states.pt... 0: [2022-11-26 11:00:40,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_20-model_00-model_states.pt. 0: [2022-11-26 11:00:40,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_21-model_00-model_states.pt... 0: [2022-11-26 11:00:40,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_21-model_00-model_states.pt. 0: [2022-11-26 11:00:40,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_22-model_00-model_states.pt... 0: [2022-11-26 11:00:40,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_22-model_00-model_states.pt. 0: [2022-11-26 11:00:40,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_23-model_00-model_states.pt... 0: [2022-11-26 11:00:40,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_23-model_00-model_states.pt. 0: [2022-11-26 11:00:40,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_24-model_00-model_states.pt... 0: [2022-11-26 11:00:40,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_24-model_00-model_states.pt. 0: [2022-11-26 11:00:40,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_25-model_00-model_states.pt... 0: [2022-11-26 11:00:40,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_25-model_00-model_states.pt. 0: [2022-11-26 11:00:40,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_26-model_00-model_states.pt... 0: [2022-11-26 11:00:40,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_26-model_00-model_states.pt. 0: [2022-11-26 11:00:40,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_27-model_00-model_states.pt... 0: [2022-11-26 11:00:40,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_27-model_00-model_states.pt. 0: [2022-11-26 11:00:40,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_28-model_00-model_states.pt... 0: [2022-11-26 11:00:41,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_28-model_00-model_states.pt. 0: [2022-11-26 11:00:41,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_29-model_00-model_states.pt... 0: [2022-11-26 11:00:41,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_29-model_00-model_states.pt. 0: [2022-11-26 11:00:41,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_30-model_00-model_states.pt... 0: [2022-11-26 11:00:41,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_30-model_00-model_states.pt. 0: [2022-11-26 11:00:41,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/layer_32-model_00-model_states.pt... 0: [2022-11-26 11:00:41,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/layer_32-model_00-model_states.pt. 0: [2022-11-26 11:00:41,233] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step51000/mp_rank_00_model_states.pt 0: [2022-11-26 11:00:41,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/mp_rank_00_model_states.pt... 0: [2022-11-26 11:00:41,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/mp_rank_00_model_states.pt. 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:00:41,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step51000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:00:41,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:00:41,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 11:00:41,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 11:00:41,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:00:41,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:00:41,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 11:00:41,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 11:00:41,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 11:00:41,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 11:00:41,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 11:00:41,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 11:00:41,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:00:41,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 11:00:41,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 11:00:41,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 11:00:41,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 11:00:41,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 11:00:41,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 11:00:41,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 11:00:41,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 11:00:41,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 12: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 11:00:41,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:00:41,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:00:41,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 11:00:41,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 11:00:41,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 11:00:41,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 3: [2022-11-26 11:00:41,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 11:00:41,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 11:00:41,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 11:00:41,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:00:41,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:00:41,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 10: [2022-11-26 11:00:41,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:00:41,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 11:00:41,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 4: [2022-11-26 11:00:41,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:00:41,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 11:00:41,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 2: [2022-11-26 11:00:41,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:00:41,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 11:00:41,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 11:00:41,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 14: [2022-11-26 11:00:41,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:00:41,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 11:00:41,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 15: [2022-11-26 11:00:41,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 11:00:41,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 11:00:41,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:00:41,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 11:00:41,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 9: [2022-11-26 11:00:41,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 5: [2022-11-26 11:00:41,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:00:41,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 11:00:41,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 11:00:41,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 11:00:41,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 8: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:00:41,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 11:00:41,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 11:00:41,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:00:41,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 6: [2022-11-26 11:00:41,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:00:41,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 11:00:41,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 13: [2022-11-26 11:00:41,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:00:41,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 11:00:41,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 1: [2022-11-26 11:00:41,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:00:41,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 11:00:41,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:00:41,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 11: [2022-11-26 11:00:41,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 11:00:41,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: [2022-11-26 11:00:41,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 11:00:41,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step51000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 7: [2022-11-26 11:00:41,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step51000 is ready now! 0: successfully saved checkpoint at iteration 51000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3825.30 15: iteration 51010/ 125429 | consumed samples: 13058560 | consumed tokens: 26743930880 | elapsed time per iteration (s): 1.49 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.014108E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.032 | TFLOPs: 28.43 | 15: iteration 51020/ 125429 | consumed samples: 13061120 | consumed tokens: 26749173760 | elapsed time per iteration (s): 1.04 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.050269E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.089 | TFLOPs: 40.50 | 15: iteration 51030/ 125429 | consumed samples: 13063680 | consumed tokens: 26754416640 | elapsed time per iteration (s): 1.07 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 1.993677E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.334 | TFLOPs: 39.72 | 15: iteration 51040/ 125429 | consumed samples: 13066240 | consumed tokens: 26759659520 | elapsed time per iteration (s): 1.03 | learning rate: 1.376E-04 | global batch size: 256 | lm loss: 2.030118E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.466 | TFLOPs: 41.06 | 15: iteration 51050/ 125429 | consumed samples: 13068800 | consumed tokens: 26764902400 | elapsed time per iteration (s): 1.02 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.022705E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.030 | TFLOPs: 41.32 | 15: iteration 51060/ 125429 | consumed samples: 13071360 | consumed tokens: 26770145280 | elapsed time per iteration (s): 1.04 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 1.989169E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.929 | TFLOPs: 40.81 | 15: iteration 51070/ 125429 | consumed samples: 13073920 | consumed tokens: 26775388160 | elapsed time per iteration (s): 1.04 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.022300E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.727 | TFLOPs: 40.61 | 15: iteration 51080/ 125429 | consumed samples: 13076480 | consumed tokens: 26780631040 | elapsed time per iteration (s): 1.04 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.019012E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.459 | TFLOPs: 40.56 | 15: iteration 51090/ 125429 | consumed samples: 13079040 | consumed tokens: 26785873920 | elapsed time per iteration (s): 1.06 | learning rate: 1.375E-04 | global batch size: 256 | lm loss: 2.022004E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.885 | TFLOPs: 39.81 | 15: iteration 51100/ 125429 | consumed samples: 13081600 | consumed tokens: 26791116800 | elapsed time per iteration (s): 1.07 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 1.994989E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.264 | TFLOPs: 39.71 | 15: iteration 51110/ 125429 | consumed samples: 13084160 | consumed tokens: 26796359680 | elapsed time per iteration (s): 1.06 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.027062E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.930 | TFLOPs: 39.82 | 15: iteration 51120/ 125429 | consumed samples: 13086720 | consumed tokens: 26801602560 | elapsed time per iteration (s): 1.03 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.014210E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.899 | TFLOPs: 41.13 | 15: iteration 51130/ 125429 | consumed samples: 13089280 | consumed tokens: 26806845440 | elapsed time per iteration (s): 1.03 | learning rate: 1.374E-04 | global batch size: 256 | lm loss: 2.018074E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.741 | TFLOPs: 41.11 | 15: iteration 51140/ 125429 | consumed samples: 13091840 | consumed tokens: 26812088320 | elapsed time per iteration (s): 1.07 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.020399E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.363 | TFLOPs: 39.56 | 15: iteration 51150/ 125429 | consumed samples: 13094400 | consumed tokens: 26817331200 | elapsed time per iteration (s): 1.05 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.055607E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.228 | TFLOPs: 40.36 | 15: iteration 51160/ 125429 | consumed samples: 13096960 | consumed tokens: 26822574080 | elapsed time per iteration (s): 1.04 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.045529E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.181 | TFLOPs: 40.68 | 15: iteration 51170/ 125429 | consumed samples: 13099520 | consumed tokens: 26827816960 | elapsed time per iteration (s): 1.05 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 1.996423E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.834 | TFLOPs: 40.13 | 15: iteration 51180/ 125429 | consumed samples: 13102080 | consumed tokens: 26833059840 | elapsed time per iteration (s): 1.04 | learning rate: 1.373E-04 | global batch size: 256 | lm loss: 2.003832E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.572 | TFLOPs: 40.75 | 15: iteration 51190/ 125429 | consumed samples: 13104640 | consumed tokens: 26838302720 | elapsed time per iteration (s): 1.04 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.022166E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.004 | TFLOPs: 40.49 | 15: iteration 51200/ 125429 | consumed samples: 13107200 | consumed tokens: 26843545600 | elapsed time per iteration (s): 1.05 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 1.998725E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.697 | TFLOPs: 40.44 | 15: iteration 51210/ 125429 | consumed samples: 13109760 | consumed tokens: 26848788480 | elapsed time per iteration (s): 1.05 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.006693E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.736 | TFLOPs: 40.44 | 15: iteration 51220/ 125429 | consumed samples: 13112320 | consumed tokens: 26854031360 | elapsed time per iteration (s): 1.03 | learning rate: 1.372E-04 | global batch size: 256 | lm loss: 2.023752E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.698 | TFLOPs: 41.10 | 15: iteration 51230/ 125429 | consumed samples: 13114880 | consumed tokens: 26859274240 | elapsed time per iteration (s): 1.02 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 1.982523E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.568 | TFLOPs: 41.57 | 15: iteration 51240/ 125429 | consumed samples: 13117440 | consumed tokens: 26864517120 | elapsed time per iteration (s): 1.06 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.020471E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.443 | TFLOPs: 39.90 | 15: iteration 51250/ 125429 | consumed samples: 13120000 | consumed tokens: 26869760000 | elapsed time per iteration (s): 1.07 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.004679E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.394 | TFLOPs: 39.56 | 15: iteration 51260/ 125429 | consumed samples: 13122560 | consumed tokens: 26875002880 | elapsed time per iteration (s): 1.11 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 2.004154E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.048 | TFLOPs: 38.18 | 15: iteration 51270/ 125429 | consumed samples: 13125120 | consumed tokens: 26880245760 | elapsed time per iteration (s): 1.05 | learning rate: 1.371E-04 | global batch size: 256 | lm loss: 1.988485E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.569 | TFLOPs: 40.25 | 15: iteration 51280/ 125429 | consumed samples: 13127680 | consumed tokens: 26885488640 | elapsed time per iteration (s): 1.04 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.008551E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.191 | TFLOPs: 40.69 | 15: iteration 51290/ 125429 | consumed samples: 13130240 | consumed tokens: 26890731520 | elapsed time per iteration (s): 1.04 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.012642E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.876 | TFLOPs: 40.80 | 15: iteration 51300/ 125429 | consumed samples: 13132800 | consumed tokens: 26895974400 | elapsed time per iteration (s): 1.03 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.050447E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.684 | TFLOPs: 41.26 | 15: iteration 51310/ 125429 | consumed samples: 13135360 | consumed tokens: 26901217280 | elapsed time per iteration (s): 1.04 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.014142E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.452 | TFLOPs: 40.73 | 15: iteration 51320/ 125429 | consumed samples: 13137920 | consumed tokens: 26906460160 | elapsed time per iteration (s): 1.03 | learning rate: 1.370E-04 | global batch size: 256 | lm loss: 2.009180E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.787 | TFLOPs: 41.11 | 15: iteration 51330/ 125429 | consumed samples: 13140480 | consumed tokens: 26911703040 | elapsed time per iteration (s): 1.05 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.048065E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.532 | TFLOPs: 40.41 | 15: iteration 51340/ 125429 | consumed samples: 13143040 | consumed tokens: 26916945920 | elapsed time per iteration (s): 1.04 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.028332E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.965 | TFLOPs: 40.81 | 15: iteration 51350/ 125429 | consumed samples: 13145600 | consumed tokens: 26922188800 | elapsed time per iteration (s): 1.04 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 1.985136E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.245 | TFLOPs: 40.86 | 15: iteration 51360/ 125429 | consumed samples: 13148160 | consumed tokens: 26927431680 | elapsed time per iteration (s): 1.05 | learning rate: 1.369E-04 | global batch size: 256 | lm loss: 2.030118E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.372 | TFLOPs: 40.22 | 15: iteration 51370/ 125429 | consumed samples: 13150720 | consumed tokens: 26932674560 | elapsed time per iteration (s): 1.08 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.017210E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.168 | TFLOPs: 39.03 | 15: iteration 51380/ 125429 | consumed samples: 13153280 | consumed tokens: 26937917440 | elapsed time per iteration (s): 1.03 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 1.999545E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.730 | TFLOPs: 40.94 | 15: iteration 51390/ 125429 | consumed samples: 13155840 | consumed tokens: 26943160320 | elapsed time per iteration (s): 1.05 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.031809E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.214 | TFLOPs: 40.36 | 15: iteration 51400/ 125429 | consumed samples: 13158400 | consumed tokens: 26948403200 | elapsed time per iteration (s): 1.06 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 1.997499E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.675 | TFLOPs: 39.94 | 15: iteration 51410/ 125429 | consumed samples: 13160960 | consumed tokens: 26953646080 | elapsed time per iteration (s): 1.07 | learning rate: 1.368E-04 | global batch size: 256 | lm loss: 2.007900E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.004 | TFLOPs: 39.50 | 15: iteration 51420/ 125429 | consumed samples: 13163520 | consumed tokens: 26958888960 | elapsed time per iteration (s): 1.03 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.015005E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.621 | TFLOPs: 40.92 | 15: iteration 51430/ 125429 | consumed samples: 13166080 | consumed tokens: 26964131840 | elapsed time per iteration (s): 1.05 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.031886E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.417 | TFLOPs: 40.23 | 15: iteration 51440/ 125429 | consumed samples: 13168640 | consumed tokens: 26969374720 | elapsed time per iteration (s): 1.05 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.004964E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.717 | TFLOPs: 40.44 | 15: iteration 51450/ 125429 | consumed samples: 13171200 | consumed tokens: 26974617600 | elapsed time per iteration (s): 1.05 | learning rate: 1.367E-04 | global batch size: 256 | lm loss: 2.018211E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.686 | TFLOPs: 40.44 | 15: iteration 51460/ 125429 | consumed samples: 13173760 | consumed tokens: 26979860480 | elapsed time per iteration (s): 1.06 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 1.975400E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.208 | TFLOPs: 40.03 | 15: iteration 51470/ 125429 | consumed samples: 13176320 | consumed tokens: 26985103360 | elapsed time per iteration (s): 1.10 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.013226E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.431 | TFLOPs: 38.58 | 15: iteration 51480/ 125429 | consumed samples: 13178880 | consumed tokens: 26990346240 | elapsed time per iteration (s): 1.07 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 1.997258E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.403 | TFLOPs: 39.40 | 15: iteration 51490/ 125429 | consumed samples: 13181440 | consumed tokens: 26995589120 | elapsed time per iteration (s): 1.06 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 1.989616E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.036 | TFLOPs: 40.00 | 15: iteration 51500/ 125429 | consumed samples: 13184000 | consumed tokens: 27000832000 | elapsed time per iteration (s): 1.05 | learning rate: 1.366E-04 | global batch size: 256 | lm loss: 2.012127E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.896 | TFLOPs: 40.31 | 15: iteration 51510/ 125429 | consumed samples: 13186560 | consumed tokens: 27006074880 | elapsed time per iteration (s): 1.14 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 1.996990E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.988 | TFLOPs: 37.02 | 15: iteration 51520/ 125429 | consumed samples: 13189120 | consumed tokens: 27011317760 | elapsed time per iteration (s): 1.05 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.003159E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.203 | TFLOPs: 40.36 | 15: iteration 51530/ 125429 | consumed samples: 13191680 | consumed tokens: 27016560640 | elapsed time per iteration (s): 1.03 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.002935E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.660 | TFLOPs: 41.26 | 15: iteration 51540/ 125429 | consumed samples: 13194240 | consumed tokens: 27021803520 | elapsed time per iteration (s): 1.04 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.019133E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.090 | TFLOPs: 40.83 | 15: iteration 51550/ 125429 | consumed samples: 13196800 | consumed tokens: 27027046400 | elapsed time per iteration (s): 1.05 | learning rate: 1.365E-04 | global batch size: 256 | lm loss: 2.012208E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.020 | TFLOPs: 40.16 | 15: iteration 51560/ 125429 | consumed samples: 13199360 | consumed tokens: 27032289280 | elapsed time per iteration (s): 1.04 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.031199E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.824 | TFLOPs: 40.79 | 15: iteration 51570/ 125429 | consumed samples: 13201920 | consumed tokens: 27037532160 | elapsed time per iteration (s): 1.06 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.020681E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.581 | TFLOPs: 40.09 | 15: iteration 51580/ 125429 | consumed samples: 13204480 | consumed tokens: 27042775040 | elapsed time per iteration (s): 1.04 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 2.005176E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.842 | TFLOPs: 40.79 | 15: iteration 51590/ 125429 | consumed samples: 13207040 | consumed tokens: 27048017920 | elapsed time per iteration (s): 1.04 | learning rate: 1.364E-04 | global batch size: 256 | lm loss: 1.995372E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.502 | TFLOPs: 40.74 | 15: iteration 51600/ 125429 | consumed samples: 13209600 | consumed tokens: 27053260800 | elapsed time per iteration (s): 1.04 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 1.997529E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.183 | TFLOPs: 40.68 | 15: iteration 51610/ 125429 | consumed samples: 13212160 | consumed tokens: 27058503680 | elapsed time per iteration (s): 1.05 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.030347E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.701 | TFLOPs: 40.44 | 15: iteration 51620/ 125429 | consumed samples: 13214720 | consumed tokens: 27063746560 | elapsed time per iteration (s): 1.05 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.019484E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.712 | TFLOPs: 40.44 | 15: iteration 51630/ 125429 | consumed samples: 13217280 | consumed tokens: 27068989440 | elapsed time per iteration (s): 1.03 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.006108E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.698 | TFLOPs: 41.10 | 15: iteration 51640/ 125429 | consumed samples: 13219840 | consumed tokens: 27074232320 | elapsed time per iteration (s): 1.04 | learning rate: 1.363E-04 | global batch size: 256 | lm loss: 2.006943E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.790 | TFLOPs: 40.78 | 15: iteration 51650/ 125429 | consumed samples: 13222400 | consumed tokens: 27079475200 | elapsed time per iteration (s): 1.03 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 2.029142E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.850 | TFLOPs: 41.12 | 15: iteration 51660/ 125429 | consumed samples: 13224960 | consumed tokens: 27084718080 | elapsed time per iteration (s): 1.03 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 1.998736E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.625 | TFLOPs: 41.25 | 15: iteration 51670/ 125429 | consumed samples: 13227520 | consumed tokens: 27089960960 | elapsed time per iteration (s): 1.04 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 2.002676E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.035 | TFLOPs: 40.82 | 15: iteration 51680/ 125429 | consumed samples: 13230080 | consumed tokens: 27095203840 | elapsed time per iteration (s): 1.07 | learning rate: 1.362E-04 | global batch size: 256 | lm loss: 1.969548E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.950 | TFLOPs: 39.49 | 15: iteration 51690/ 125429 | consumed samples: 13232640 | consumed tokens: 27100446720 | elapsed time per iteration (s): 1.05 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.023298E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.787 | TFLOPs: 40.45 | 15: iteration 51700/ 125429 | consumed samples: 13235200 | consumed tokens: 27105689600 | elapsed time per iteration (s): 1.05 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 1.989489E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.332 | TFLOPs: 40.38 | 15: iteration 51710/ 125429 | consumed samples: 13237760 | consumed tokens: 27110932480 | elapsed time per iteration (s): 1.03 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 1.998645E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.589 | TFLOPs: 40.92 | 15: iteration 51720/ 125429 | consumed samples: 13240320 | consumed tokens: 27116175360 | elapsed time per iteration (s): 1.03 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.012079E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.793 | TFLOPs: 40.95 | 15: iteration 51730/ 125429 | consumed samples: 13242880 | consumed tokens: 27121418240 | elapsed time per iteration (s): 1.06 | learning rate: 1.361E-04 | global batch size: 256 | lm loss: 2.013679E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.869 | TFLOPs: 39.97 | 15: iteration 51740/ 125429 | consumed samples: 13245440 | consumed tokens: 27126661120 | elapsed time per iteration (s): 1.03 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 1.984356E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.391 | TFLOPs: 41.05 | 15: iteration 51750/ 125429 | consumed samples: 13248000 | consumed tokens: 27131904000 | elapsed time per iteration (s): 1.08 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 1.994120E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.503 | TFLOPs: 39.08 | 15: iteration 51760/ 125429 | consumed samples: 13250560 | consumed tokens: 27137146880 | elapsed time per iteration (s): 1.05 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 1.997581E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.436 | TFLOPs: 40.39 | 15: iteration 51770/ 125429 | consumed samples: 13253120 | consumed tokens: 27142389760 | elapsed time per iteration (s): 1.03 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.023384E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.971 | TFLOPs: 41.14 | 15: iteration 51780/ 125429 | consumed samples: 13255680 | consumed tokens: 27147632640 | elapsed time per iteration (s): 1.02 | learning rate: 1.360E-04 | global batch size: 256 | lm loss: 2.023288E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.295 | TFLOPs: 41.53 | 15: iteration 51790/ 125429 | consumed samples: 13258240 | consumed tokens: 27152875520 | elapsed time per iteration (s): 1.05 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.018508E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.830 | TFLOPs: 40.13 | 15: iteration 51800/ 125429 | consumed samples: 13260800 | consumed tokens: 27158118400 | elapsed time per iteration (s): 1.03 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.002443E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.184 | TFLOPs: 41.18 | 15: iteration 51810/ 125429 | consumed samples: 13263360 | consumed tokens: 27163361280 | elapsed time per iteration (s): 1.03 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.016970E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.364 | TFLOPs: 41.21 | 15: iteration 51820/ 125429 | consumed samples: 13265920 | consumed tokens: 27168604160 | elapsed time per iteration (s): 1.05 | learning rate: 1.359E-04 | global batch size: 256 | lm loss: 2.030082E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.797 | TFLOPs: 40.45 | 15: iteration 51830/ 125429 | consumed samples: 13268480 | consumed tokens: 27173847040 | elapsed time per iteration (s): 1.04 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.004303E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.957 | TFLOPs: 40.81 | 15: iteration 51840/ 125429 | consumed samples: 13271040 | consumed tokens: 27179089920 | elapsed time per iteration (s): 1.05 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.007933E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.139 | TFLOPs: 40.35 | 15: iteration 51850/ 125429 | consumed samples: 13273600 | consumed tokens: 27184332800 | elapsed time per iteration (s): 1.03 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 1.999926E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.378 | TFLOPs: 41.21 | 15: iteration 51860/ 125429 | consumed samples: 13276160 | consumed tokens: 27189575680 | elapsed time per iteration (s): 1.03 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 1.997950E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.387 | TFLOPs: 41.21 | 15: iteration 51870/ 125429 | consumed samples: 13278720 | consumed tokens: 27194818560 | elapsed time per iteration (s): 1.02 | learning rate: 1.358E-04 | global batch size: 256 | lm loss: 2.017077E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.025 | TFLOPs: 41.32 | 15: iteration 51880/ 125429 | consumed samples: 13281280 | consumed tokens: 27200061440 | elapsed time per iteration (s): 1.03 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 1.998045E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.626 | TFLOPs: 40.92 | 15: iteration 51890/ 125429 | consumed samples: 13283840 | consumed tokens: 27205304320 | elapsed time per iteration (s): 1.06 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 2.003564E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.580 | TFLOPs: 39.76 | 15: iteration 51900/ 125429 | consumed samples: 13286400 | consumed tokens: 27210547200 | elapsed time per iteration (s): 1.04 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 1.997734E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.119 | TFLOPs: 40.51 | 15: iteration 51910/ 125429 | consumed samples: 13288960 | consumed tokens: 27215790080 | elapsed time per iteration (s): 1.03 | learning rate: 1.357E-04 | global batch size: 256 | lm loss: 1.990635E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.490 | TFLOPs: 41.06 | 15: iteration 51920/ 125429 | consumed samples: 13291520 | consumed tokens: 27221032960 | elapsed time per iteration (s): 1.06 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 1.990302E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.416 | TFLOPs: 39.90 | 15: iteration 51930/ 125429 | consumed samples: 13294080 | consumed tokens: 27226275840 | elapsed time per iteration (s): 1.02 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 1.993055E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.714 | TFLOPs: 41.43 | 15: iteration 51940/ 125429 | consumed samples: 13296640 | consumed tokens: 27231518720 | elapsed time per iteration (s): 1.02 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.007200E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.643 | TFLOPs: 41.42 | 15: iteration 51950/ 125429 | consumed samples: 13299200 | consumed tokens: 27236761600 | elapsed time per iteration (s): 1.06 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.007953E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.879 | TFLOPs: 39.97 | 15: iteration 51960/ 125429 | consumed samples: 13301760 | consumed tokens: 27242004480 | elapsed time per iteration (s): 1.03 | learning rate: 1.356E-04 | global batch size: 256 | lm loss: 2.011664E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.045 | TFLOPs: 40.99 | 15: iteration 51970/ 125429 | consumed samples: 13304320 | consumed tokens: 27247247360 | elapsed time per iteration (s): 1.05 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.013151E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.531 | TFLOPs: 40.25 | 15: iteration 51980/ 125429 | consumed samples: 13306880 | consumed tokens: 27252490240 | elapsed time per iteration (s): 1.05 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 1.993593E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.759 | TFLOPs: 40.28 | 15: iteration 51990/ 125429 | consumed samples: 13309440 | consumed tokens: 27257733120 | elapsed time per iteration (s): 1.02 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.009283E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.763 | TFLOPs: 41.28 | 0: [2022-11-26 11:18:07,613] [INFO] [logging.py:68:log_dist] [Rank 0] step=52000, skipped=0, lr=[0.00013547182636253088, 0.00013547182636253088, 0.00013547182636253088], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 52000/ 125429 | consumed samples: 13312000 | consumed tokens: 27262976000 | elapsed time per iteration (s): 1.04 | learning rate: 1.355E-04 | global batch size: 256 | lm loss: 2.017097E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.040 | TFLOPs: 40.49 | 0: steps: 52000 loss: 1.9379 iter time (s): 1.044 samples/sec: 245.255 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 52000 | lm loss value: 1.982164E+00 | lm loss PPL: 7.258430E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 52000 to checkpoints_1b5 0: [2022-11-26 11:18:07,990] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step52000 is begin to save! 0: [2022-11-26 11:18:07,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_01-model_00-model_states.pt... 0: [2022-11-26 11:18:08,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_01-model_00-model_states.pt. 0: [2022-11-26 11:18:08,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_03-model_00-model_states.pt... 0: [2022-11-26 11:18:08,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_03-model_00-model_states.pt. 0: [2022-11-26 11:18:08,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_04-model_00-model_states.pt... 0: [2022-11-26 11:18:08,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_04-model_00-model_states.pt. 0: [2022-11-26 11:18:08,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_05-model_00-model_states.pt... 0: [2022-11-26 11:18:08,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_05-model_00-model_states.pt. 0: [2022-11-26 11:18:08,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_06-model_00-model_states.pt... 0: [2022-11-26 11:18:08,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_06-model_00-model_states.pt. 0: [2022-11-26 11:18:08,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_07-model_00-model_states.pt... 0: [2022-11-26 11:18:08,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_07-model_00-model_states.pt. 0: [2022-11-26 11:18:08,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_08-model_00-model_states.pt... 0: [2022-11-26 11:18:08,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_08-model_00-model_states.pt. 0: [2022-11-26 11:18:08,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_09-model_00-model_states.pt... 0: [2022-11-26 11:18:08,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_09-model_00-model_states.pt. 0: [2022-11-26 11:18:08,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_10-model_00-model_states.pt... 0: [2022-11-26 11:18:09,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_10-model_00-model_states.pt. 0: [2022-11-26 11:18:09,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_11-model_00-model_states.pt... 0: [2022-11-26 11:18:09,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_11-model_00-model_states.pt. 0: [2022-11-26 11:18:09,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_12-model_00-model_states.pt... 0: [2022-11-26 11:18:09,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_12-model_00-model_states.pt. 0: [2022-11-26 11:18:09,280] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_13-model_00-model_states.pt... 0: [2022-11-26 11:18:09,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_13-model_00-model_states.pt. 0: [2022-11-26 11:18:09,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_14-model_00-model_states.pt... 0: [2022-11-26 11:18:09,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_14-model_00-model_states.pt. 0: [2022-11-26 11:18:09,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_15-model_00-model_states.pt... 0: [2022-11-26 11:18:09,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_15-model_00-model_states.pt. 0: [2022-11-26 11:18:09,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_16-model_00-model_states.pt... 0: [2022-11-26 11:18:09,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_16-model_00-model_states.pt. 0: [2022-11-26 11:18:09,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_17-model_00-model_states.pt... 0: [2022-11-26 11:18:09,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_17-model_00-model_states.pt. 0: [2022-11-26 11:18:09,801] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_18-model_00-model_states.pt... 0: [2022-11-26 11:18:09,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_18-model_00-model_states.pt. 0: [2022-11-26 11:18:09,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_19-model_00-model_states.pt... 0: [2022-11-26 11:18:10,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_19-model_00-model_states.pt. 0: [2022-11-26 11:18:10,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_20-model_00-model_states.pt... 0: [2022-11-26 11:18:10,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_20-model_00-model_states.pt. 0: [2022-11-26 11:18:10,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_21-model_00-model_states.pt... 0: [2022-11-26 11:18:10,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_21-model_00-model_states.pt. 0: [2022-11-26 11:18:10,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_22-model_00-model_states.pt... 0: [2022-11-26 11:18:10,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_22-model_00-model_states.pt. 0: [2022-11-26 11:18:10,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_23-model_00-model_states.pt... 0: [2022-11-26 11:18:10,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_23-model_00-model_states.pt. 0: [2022-11-26 11:18:10,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_24-model_00-model_states.pt... 0: [2022-11-26 11:18:10,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_24-model_00-model_states.pt. 0: [2022-11-26 11:18:10,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_25-model_00-model_states.pt... 0: [2022-11-26 11:18:10,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_25-model_00-model_states.pt. 0: [2022-11-26 11:18:10,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_26-model_00-model_states.pt... 0: [2022-11-26 11:18:10,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_26-model_00-model_states.pt. 0: [2022-11-26 11:18:10,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_27-model_00-model_states.pt... 0: [2022-11-26 11:18:10,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_27-model_00-model_states.pt. 0: [2022-11-26 11:18:10,855] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_28-model_00-model_states.pt... 0: [2022-11-26 11:18:10,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_28-model_00-model_states.pt. 0: [2022-11-26 11:18:10,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_29-model_00-model_states.pt... 0: [2022-11-26 11:18:11,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_29-model_00-model_states.pt. 0: [2022-11-26 11:18:11,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_30-model_00-model_states.pt... 0: [2022-11-26 11:18:11,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_30-model_00-model_states.pt. 0: [2022-11-26 11:18:11,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/layer_32-model_00-model_states.pt... 0: [2022-11-26 11:18:11,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/layer_32-model_00-model_states.pt. 0: [2022-11-26 11:18:11,171] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step52000/mp_rank_00_model_states.pt 0: [2022-11-26 11:18:11,171] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/mp_rank_00_model_states.pt... 0: [2022-11-26 11:18:11,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/mp_rank_00_model_states.pt. 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:18:11,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step52000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:18:11,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:18:11,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 11:18:11,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 11:18:11,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:18:11,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:18:11,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:18:11,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 11:18:11,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 11:18:11,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 11:18:11,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 11:18:11,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,385] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,385] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 11:18:11,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 7: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 11:18:11,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:18:11,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 3: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,386] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:18:11,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:18:11,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,383] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,383] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:18:11,388] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 11:18:11,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:18:11,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 11:18:11,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 11:18:11,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 11:18:11,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:18:11,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:18:11,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 11:18:11,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:18:11,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 11:18:11,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 11:18:11,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 11:18:11,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 11:18:11,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:18:11,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 11:18:11,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 11:18:11,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:18:11,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:18:11,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:18:11,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 11:18:11,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 8: [2022-11-26 11:18:11,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 15: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:18:11,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 11:18:11,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 7: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 14: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:18:11,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 11:18:11,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:18:11,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 11:18:11,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 11:18:11,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:18:11,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:18:11,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 11:18:11,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 12: [2022-11-26 11:18:11,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 11:18:11,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 11:18:11,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 3: [2022-11-26 11:18:11,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:18:11,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 11:18:11,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 1: [2022-11-26 11:18:11,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:18:11,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 11:18:11,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 10: [2022-11-26 11:18:11,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:18:11,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 11:18:11,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:18:11,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 2: [2022-11-26 11:18:11,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:18:11,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 11:18:11,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 11: [2022-11-26 11:18:11,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:18:11,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 11:18:11,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 9: [2022-11-26 11:18:11,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:18:11,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 11:18:11,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 6: [2022-11-26 11:18:11,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:18:11,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 11:18:11,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 5: [2022-11-26 11:18:11,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 4: [2022-11-26 11:18:11,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:18:11,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 11:18:11,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 13: [2022-11-26 11:18:11,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:18:11,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 11:18:11,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: [2022-11-26 11:18:11,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step52000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 11:18:11,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step52000 is ready now! 0: successfully saved checkpoint at iteration 52000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3625.04 15: iteration 52010/ 125429 | consumed samples: 13314560 | consumed tokens: 27268218880 | elapsed time per iteration (s): 1.43 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 1.989047E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.435 | TFLOPs: 29.65 | 15: iteration 52020/ 125429 | consumed samples: 13317120 | consumed tokens: 27273461760 | elapsed time per iteration (s): 1.07 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 2.007180E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.073 | TFLOPs: 39.67 | 15: iteration 52030/ 125429 | consumed samples: 13319680 | consumed tokens: 27278704640 | elapsed time per iteration (s): 1.04 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 1.986224E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.369 | TFLOPs: 40.55 | 15: iteration 52040/ 125429 | consumed samples: 13322240 | consumed tokens: 27283947520 | elapsed time per iteration (s): 1.03 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 1.983901E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.629 | TFLOPs: 41.09 | 15: iteration 52050/ 125429 | consumed samples: 13324800 | consumed tokens: 27289190400 | elapsed time per iteration (s): 1.08 | learning rate: 1.354E-04 | global batch size: 256 | lm loss: 1.986714E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.661 | TFLOPs: 39.28 | 15: iteration 52060/ 125429 | consumed samples: 13327360 | consumed tokens: 27294433280 | elapsed time per iteration (s): 1.05 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.003878E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.410 | TFLOPs: 40.39 | 15: iteration 52070/ 125429 | consumed samples: 13329920 | consumed tokens: 27299676160 | elapsed time per iteration (s): 1.02 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.030054E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.927 | TFLOPs: 41.47 | 15: iteration 52080/ 125429 | consumed samples: 13332480 | consumed tokens: 27304919040 | elapsed time per iteration (s): 1.06 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 1.979981E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.972 | TFLOPs: 39.82 | 15: iteration 52090/ 125429 | consumed samples: 13335040 | consumed tokens: 27310161920 | elapsed time per iteration (s): 1.06 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 2.001109E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.840 | TFLOPs: 39.80 | 15: iteration 52100/ 125429 | consumed samples: 13337600 | consumed tokens: 27315404800 | elapsed time per iteration (s): 1.05 | learning rate: 1.353E-04 | global batch size: 256 | lm loss: 1.978970E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.530 | TFLOPs: 40.41 | 15: iteration 52110/ 125429 | consumed samples: 13340160 | consumed tokens: 27320647680 | elapsed time per iteration (s): 1.03 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.034870E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.583 | TFLOPs: 41.25 | 15: iteration 52120/ 125429 | consumed samples: 13342720 | consumed tokens: 27325890560 | elapsed time per iteration (s): 2.54 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 1.996640E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 100.777 | TFLOPs: 16.65 | 15: iteration 52130/ 125429 | consumed samples: 13345280 | consumed tokens: 27331133440 | elapsed time per iteration (s): 1.05 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.017865E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.061 | TFLOPs: 40.17 | 15: iteration 52140/ 125429 | consumed samples: 13347840 | consumed tokens: 27336376320 | elapsed time per iteration (s): 1.03 | learning rate: 1.352E-04 | global batch size: 256 | lm loss: 2.020905E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.527 | TFLOPs: 40.91 | 15: iteration 52150/ 125429 | consumed samples: 13350400 | consumed tokens: 27341619200 | elapsed time per iteration (s): 1.03 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.034237E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.978 | TFLOPs: 41.15 | 15: iteration 52160/ 125429 | consumed samples: 13352960 | consumed tokens: 27346862080 | elapsed time per iteration (s): 1.05 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.016331E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.749 | TFLOPs: 40.45 | 15: iteration 52170/ 125429 | consumed samples: 13355520 | consumed tokens: 27352104960 | elapsed time per iteration (s): 1.43 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.026547E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.405 | TFLOPs: 29.65 | 15: iteration 52180/ 125429 | consumed samples: 13358080 | consumed tokens: 27357347840 | elapsed time per iteration (s): 1.42 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.022163E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.491 | TFLOPs: 29.83 | 15: iteration 52190/ 125429 | consumed samples: 13360640 | consumed tokens: 27362590720 | elapsed time per iteration (s): 1.02 | learning rate: 1.351E-04 | global batch size: 256 | lm loss: 2.015658E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.642 | TFLOPs: 41.59 | 15: iteration 52200/ 125429 | consumed samples: 13363200 | consumed tokens: 27367833600 | elapsed time per iteration (s): 1.04 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 1.997658E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.230 | TFLOPs: 40.86 | 15: iteration 52210/ 125429 | consumed samples: 13365760 | consumed tokens: 27373076480 | elapsed time per iteration (s): 2.71 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.044825E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 94.585 | TFLOPs: 15.63 | 15: iteration 52220/ 125429 | consumed samples: 13368320 | consumed tokens: 27378319360 | elapsed time per iteration (s): 1.04 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.018562E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.314 | TFLOPs: 40.71 | 15: iteration 52230/ 125429 | consumed samples: 13370880 | consumed tokens: 27383562240 | elapsed time per iteration (s): 1.05 | learning rate: 1.350E-04 | global batch size: 256 | lm loss: 2.008428E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.899 | TFLOPs: 40.14 | 15: iteration 52240/ 125429 | consumed samples: 13373440 | consumed tokens: 27388805120 | elapsed time per iteration (s): 1.03 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 1.974340E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.077 | TFLOPs: 41.16 | 15: iteration 52250/ 125429 | consumed samples: 13376000 | consumed tokens: 27394048000 | elapsed time per iteration (s): 1.05 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.008769E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.167 | TFLOPs: 40.35 | 15: iteration 52260/ 125429 | consumed samples: 13378560 | consumed tokens: 27399290880 | elapsed time per iteration (s): 1.05 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.004653E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.994 | TFLOPs: 40.16 | 15: iteration 52270/ 125429 | consumed samples: 13381120 | consumed tokens: 27404533760 | elapsed time per iteration (s): 1.06 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 1.999847E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.685 | TFLOPs: 39.77 | 15: iteration 52280/ 125429 | consumed samples: 13383680 | consumed tokens: 27409776640 | elapsed time per iteration (s): 1.14 | learning rate: 1.349E-04 | global batch size: 256 | lm loss: 2.043108E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.708 | TFLOPs: 36.97 | 15: iteration 52290/ 125429 | consumed samples: 13386240 | consumed tokens: 27415019520 | elapsed time per iteration (s): 1.15 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.024463E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.280 | TFLOPs: 36.73 | 15: iteration 52300/ 125429 | consumed samples: 13388800 | consumed tokens: 27420262400 | elapsed time per iteration (s): 1.04 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.022229E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.715 | TFLOPs: 40.61 | 15: iteration 52310/ 125429 | consumed samples: 13391360 | consumed tokens: 27425505280 | elapsed time per iteration (s): 1.04 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.009370E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.090 | TFLOPs: 40.50 | 15: iteration 52320/ 125429 | consumed samples: 13393920 | consumed tokens: 27430748160 | elapsed time per iteration (s): 1.05 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 2.005077E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.110 | TFLOPs: 40.18 | 15: iteration 52330/ 125429 | consumed samples: 13396480 | consumed tokens: 27435991040 | elapsed time per iteration (s): 1.04 | learning rate: 1.348E-04 | global batch size: 256 | lm loss: 1.998621E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.797 | TFLOPs: 40.79 | 15: iteration 52340/ 125429 | consumed samples: 13399040 | consumed tokens: 27441233920 | elapsed time per iteration (s): 1.03 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.011981E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.037 | TFLOPs: 40.99 | 15: iteration 52350/ 125429 | consumed samples: 13401600 | consumed tokens: 27446476800 | elapsed time per iteration (s): 1.06 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 1.996352E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.074 | TFLOPs: 39.84 | 15: iteration 52360/ 125429 | consumed samples: 13404160 | consumed tokens: 27451719680 | elapsed time per iteration (s): 1.14 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 1.985635E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.527 | TFLOPs: 37.27 | 15: iteration 52370/ 125429 | consumed samples: 13406720 | consumed tokens: 27456962560 | elapsed time per iteration (s): 1.05 | learning rate: 1.347E-04 | global batch size: 256 | lm loss: 2.026097E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.908 | TFLOPs: 40.14 | 15: iteration 52380/ 125429 | consumed samples: 13409280 | consumed tokens: 27462205440 | elapsed time per iteration (s): 1.08 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.004681E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.311 | TFLOPs: 39.22 | 15: iteration 52390/ 125429 | consumed samples: 13411840 | consumed tokens: 27467448320 | elapsed time per iteration (s): 1.03 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 1.996862E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.713 | TFLOPs: 40.94 | 15: iteration 52400/ 125429 | consumed samples: 13414400 | consumed tokens: 27472691200 | elapsed time per iteration (s): 1.04 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.002107E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.130 | TFLOPs: 40.51 | 15: iteration 52410/ 125429 | consumed samples: 13416960 | consumed tokens: 27477934080 | elapsed time per iteration (s): 1.08 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.029597E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.784 | TFLOPs: 39.30 | 15: iteration 52420/ 125429 | consumed samples: 13419520 | consumed tokens: 27483176960 | elapsed time per iteration (s): 1.03 | learning rate: 1.346E-04 | global batch size: 256 | lm loss: 2.026544E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.997 | TFLOPs: 40.98 | 15: iteration 52430/ 125429 | consumed samples: 13422080 | consumed tokens: 27488419840 | elapsed time per iteration (s): 1.08 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 2.002190E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.333 | TFLOPs: 39.22 | 15: iteration 52440/ 125429 | consumed samples: 13424640 | consumed tokens: 27493662720 | elapsed time per iteration (s): 1.44 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 2.005769E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.349 | TFLOPs: 29.31 | 15: iteration 52450/ 125429 | consumed samples: 13427200 | consumed tokens: 27498905600 | elapsed time per iteration (s): 1.04 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 1.987751E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.417 | TFLOPs: 40.72 | 15: iteration 52460/ 125429 | consumed samples: 13429760 | consumed tokens: 27504148480 | elapsed time per iteration (s): 1.14 | learning rate: 1.345E-04 | global batch size: 256 | lm loss: 2.009664E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.162 | TFLOPs: 37.04 | 15: iteration 52470/ 125429 | consumed samples: 13432320 | consumed tokens: 27509391360 | elapsed time per iteration (s): 1.05 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 2.032641E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.647 | TFLOPs: 40.43 | 15: iteration 52480/ 125429 | consumed samples: 13434880 | consumed tokens: 27514634240 | elapsed time per iteration (s): 1.05 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 1.993554E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.273 | TFLOPs: 40.37 | 15: iteration 52490/ 125429 | consumed samples: 13437440 | consumed tokens: 27519877120 | elapsed time per iteration (s): 1.12 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 2.008432E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.742 | TFLOPs: 37.64 | 15: iteration 52500/ 125429 | consumed samples: 13440000 | consumed tokens: 27525120000 | elapsed time per iteration (s): 1.25 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 1.985564E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 205.088 | TFLOPs: 33.89 | 15: iteration 52510/ 125429 | consumed samples: 13442560 | consumed tokens: 27530362880 | elapsed time per iteration (s): 1.04 | learning rate: 1.344E-04 | global batch size: 256 | lm loss: 1.985876E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.152 | TFLOPs: 40.84 | 15: iteration 52520/ 125429 | consumed samples: 13445120 | consumed tokens: 27535605760 | elapsed time per iteration (s): 1.04 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 1.981607E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.005 | TFLOPs: 40.65 | 15: iteration 52530/ 125429 | consumed samples: 13447680 | consumed tokens: 27540848640 | elapsed time per iteration (s): 1.05 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 1.980822E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.869 | TFLOPs: 40.30 | 15: iteration 52540/ 125429 | consumed samples: 13450240 | consumed tokens: 27546091520 | elapsed time per iteration (s): 1.03 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.020534E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.915 | TFLOPs: 40.97 | 15: iteration 52550/ 125429 | consumed samples: 13452800 | consumed tokens: 27551334400 | elapsed time per iteration (s): 1.04 | learning rate: 1.343E-04 | global batch size: 256 | lm loss: 2.008862E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.291 | TFLOPs: 40.87 | 15: iteration 52560/ 125429 | consumed samples: 13455360 | consumed tokens: 27556577280 | elapsed time per iteration (s): 1.09 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.014087E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.849 | TFLOPs: 38.98 | 15: iteration 52570/ 125429 | consumed samples: 13457920 | consumed tokens: 27561820160 | elapsed time per iteration (s): 1.09 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.007317E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.105 | TFLOPs: 38.69 | 15: iteration 52580/ 125429 | consumed samples: 13460480 | consumed tokens: 27567063040 | elapsed time per iteration (s): 1.07 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 1.998550E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.942 | TFLOPs: 39.65 | 15: iteration 52590/ 125429 | consumed samples: 13463040 | consumed tokens: 27572305920 | elapsed time per iteration (s): 1.03 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.007098E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.107 | TFLOPs: 41.17 | 15: iteration 52600/ 125429 | consumed samples: 13465600 | consumed tokens: 27577548800 | elapsed time per iteration (s): 1.07 | learning rate: 1.342E-04 | global batch size: 256 | lm loss: 2.012803E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.015 | TFLOPs: 39.66 | 15: iteration 52610/ 125429 | consumed samples: 13468160 | consumed tokens: 27582791680 | elapsed time per iteration (s): 1.03 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.004024E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.969 | TFLOPs: 41.14 | 15: iteration 52620/ 125429 | consumed samples: 13470720 | consumed tokens: 27588034560 | elapsed time per iteration (s): 1.04 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 1.979321E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.400 | TFLOPs: 40.72 | 15: iteration 52630/ 125429 | consumed samples: 13473280 | consumed tokens: 27593277440 | elapsed time per iteration (s): 1.08 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.007589E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.381 | TFLOPs: 39.23 | 15: iteration 52640/ 125429 | consumed samples: 13475840 | consumed tokens: 27598520320 | elapsed time per iteration (s): 1.04 | learning rate: 1.341E-04 | global batch size: 256 | lm loss: 2.024520E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.257 | TFLOPs: 40.70 | 15: iteration 52650/ 125429 | consumed samples: 13478400 | consumed tokens: 27603763200 | elapsed time per iteration (s): 1.06 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.015588E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.274 | TFLOPs: 40.04 | 15: iteration 52660/ 125429 | consumed samples: 13480960 | consumed tokens: 27609006080 | elapsed time per iteration (s): 1.03 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 1.991391E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.955 | TFLOPs: 41.14 | 15: iteration 52670/ 125429 | consumed samples: 13483520 | consumed tokens: 27614248960 | elapsed time per iteration (s): 1.08 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.003608E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.887 | TFLOPs: 39.15 | 15: iteration 52680/ 125429 | consumed samples: 13486080 | consumed tokens: 27619491840 | elapsed time per iteration (s): 1.04 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.015002E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.559 | TFLOPs: 40.58 | 15: iteration 52690/ 125429 | consumed samples: 13488640 | consumed tokens: 27624734720 | elapsed time per iteration (s): 1.07 | learning rate: 1.340E-04 | global batch size: 256 | lm loss: 2.009993E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.889 | TFLOPs: 39.48 | 15: iteration 52700/ 125429 | consumed samples: 13491200 | consumed tokens: 27629977600 | elapsed time per iteration (s): 1.03 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 1.998167E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.542 | TFLOPs: 41.24 | 15: iteration 52710/ 125429 | consumed samples: 13493760 | consumed tokens: 27635220480 | elapsed time per iteration (s): 1.06 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 1.997087E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.727 | TFLOPs: 39.78 | 15: iteration 52720/ 125429 | consumed samples: 13496320 | consumed tokens: 27640463360 | elapsed time per iteration (s): 1.09 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 2.037872E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.106 | TFLOPs: 38.69 | 15: iteration 52730/ 125429 | consumed samples: 13498880 | consumed tokens: 27645706240 | elapsed time per iteration (s): 1.02 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 2.019756E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.940 | TFLOPs: 41.30 | 15: iteration 52740/ 125429 | consumed samples: 13501440 | consumed tokens: 27650949120 | elapsed time per iteration (s): 1.08 | learning rate: 1.339E-04 | global batch size: 256 | lm loss: 2.002931E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.913 | TFLOPs: 39.15 | 15: iteration 52750/ 125429 | consumed samples: 13504000 | consumed tokens: 27656192000 | elapsed time per iteration (s): 1.06 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.019683E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.669 | TFLOPs: 39.77 | 15: iteration 52760/ 125429 | consumed samples: 13506560 | consumed tokens: 27661434880 | elapsed time per iteration (s): 1.05 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.032836E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.453 | TFLOPs: 40.23 | 15: iteration 52770/ 125429 | consumed samples: 13509120 | consumed tokens: 27666677760 | elapsed time per iteration (s): 1.06 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.016212E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.268 | TFLOPs: 40.04 | 15: iteration 52780/ 125429 | consumed samples: 13511680 | consumed tokens: 27671920640 | elapsed time per iteration (s): 1.05 | learning rate: 1.338E-04 | global batch size: 256 | lm loss: 2.020450E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.777 | TFLOPs: 40.29 | 15: iteration 52790/ 125429 | consumed samples: 13514240 | consumed tokens: 27677163520 | elapsed time per iteration (s): 1.06 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 1.993774E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.536 | TFLOPs: 39.92 | 15: iteration 52800/ 125429 | consumed samples: 13516800 | consumed tokens: 27682406400 | elapsed time per iteration (s): 1.03 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 1.976686E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.364 | TFLOPs: 40.88 | 15: iteration 52810/ 125429 | consumed samples: 13519360 | consumed tokens: 27687649280 | elapsed time per iteration (s): 1.05 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.003624E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.305 | TFLOPs: 40.37 | 15: iteration 52820/ 125429 | consumed samples: 13521920 | consumed tokens: 27692892160 | elapsed time per iteration (s): 1.06 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.016082E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.423 | TFLOPs: 40.06 | 15: iteration 52830/ 125429 | consumed samples: 13524480 | consumed tokens: 27698135040 | elapsed time per iteration (s): 1.03 | learning rate: 1.337E-04 | global batch size: 256 | lm loss: 2.012459E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.667 | TFLOPs: 40.93 | 15: iteration 52840/ 125429 | consumed samples: 13527040 | consumed tokens: 27703377920 | elapsed time per iteration (s): 1.05 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 1.994146E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.242 | TFLOPs: 40.36 | 15: iteration 52850/ 125429 | consumed samples: 13529600 | consumed tokens: 27708620800 | elapsed time per iteration (s): 1.06 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.009203E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.807 | TFLOPs: 39.80 | 15: iteration 52860/ 125429 | consumed samples: 13532160 | consumed tokens: 27713863680 | elapsed time per iteration (s): 1.05 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 1.994959E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.893 | TFLOPs: 40.31 | 15: iteration 52870/ 125429 | consumed samples: 13534720 | consumed tokens: 27719106560 | elapsed time per iteration (s): 1.04 | learning rate: 1.336E-04 | global batch size: 256 | lm loss: 2.044281E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.002 | TFLOPs: 40.65 | 15: iteration 52880/ 125429 | consumed samples: 13537280 | consumed tokens: 27724349440 | elapsed time per iteration (s): 1.07 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 2.003813E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.739 | TFLOPs: 39.45 | 15: iteration 52890/ 125429 | consumed samples: 13539840 | consumed tokens: 27729592320 | elapsed time per iteration (s): 1.06 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 2.019641E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.180 | TFLOPs: 40.02 | 15: iteration 52900/ 125429 | consumed samples: 13542400 | consumed tokens: 27734835200 | elapsed time per iteration (s): 1.11 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 1.998481E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.583 | TFLOPs: 38.27 | 15: iteration 52910/ 125429 | consumed samples: 13544960 | consumed tokens: 27740078080 | elapsed time per iteration (s): 1.07 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 1.994487E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.123 | TFLOPs: 39.68 | 15: iteration 52920/ 125429 | consumed samples: 13547520 | consumed tokens: 27745320960 | elapsed time per iteration (s): 1.06 | learning rate: 1.335E-04 | global batch size: 256 | lm loss: 1.974384E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.896 | TFLOPs: 39.81 | 15: iteration 52930/ 125429 | consumed samples: 13550080 | consumed tokens: 27750563840 | elapsed time per iteration (s): 1.06 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 2.002498E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.592 | TFLOPs: 39.92 | 15: iteration 52940/ 125429 | consumed samples: 13552640 | consumed tokens: 27755806720 | elapsed time per iteration (s): 1.03 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 1.988657E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.648 | TFLOPs: 41.09 | 15: iteration 52950/ 125429 | consumed samples: 13555200 | consumed tokens: 27761049600 | elapsed time per iteration (s): 1.05 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 1.949637E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.846 | TFLOPs: 40.30 | 15: iteration 52960/ 125429 | consumed samples: 13557760 | consumed tokens: 27766292480 | elapsed time per iteration (s): 1.07 | learning rate: 1.334E-04 | global batch size: 256 | lm loss: 2.012422E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.221 | TFLOPs: 39.53 | 15: iteration 52970/ 125429 | consumed samples: 13560320 | consumed tokens: 27771535360 | elapsed time per iteration (s): 1.04 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.012981E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.367 | TFLOPs: 40.55 | 15: iteration 52980/ 125429 | consumed samples: 13562880 | consumed tokens: 27776778240 | elapsed time per iteration (s): 1.08 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.016447E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.040 | TFLOPs: 39.17 | 15: iteration 52990/ 125429 | consumed samples: 13565440 | consumed tokens: 27782021120 | elapsed time per iteration (s): 1.04 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 1.996748E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.997 | TFLOPs: 40.82 | 15: iteration 53000/ 125429 | consumed samples: 13568000 | consumed tokens: 27787264000 | elapsed time per iteration (s): 1.05 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.008182E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.767 | TFLOPs: 40.28 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 53000 | lm loss value: 2.047271E+00 | lm loss PPL: 7.746732E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 53000 to checkpoints_1b5 0: [2022-11-26 11:36:32,166] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step53000 is begin to save! 0: [2022-11-26 11:36:32,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_01-model_00-model_states.pt... 0: [2022-11-26 11:36:32,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_01-model_00-model_states.pt. 0: [2022-11-26 11:36:32,434] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_03-model_00-model_states.pt... 0: [2022-11-26 11:36:32,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_03-model_00-model_states.pt. 0: [2022-11-26 11:36:32,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_04-model_00-model_states.pt... 0: [2022-11-26 11:36:32,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_04-model_00-model_states.pt. 0: [2022-11-26 11:36:32,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_05-model_00-model_states.pt... 0: [2022-11-26 11:36:32,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_05-model_00-model_states.pt. 0: [2022-11-26 11:36:32,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_06-model_00-model_states.pt... 0: [2022-11-26 11:36:32,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_06-model_00-model_states.pt. 0: [2022-11-26 11:36:32,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_07-model_00-model_states.pt... 0: [2022-11-26 11:36:32,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_07-model_00-model_states.pt. 0: [2022-11-26 11:36:32,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_08-model_00-model_states.pt... 0: [2022-11-26 11:36:33,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_08-model_00-model_states.pt. 0: [2022-11-26 11:36:33,101] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_09-model_00-model_states.pt... 0: [2022-11-26 11:36:33,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_09-model_00-model_states.pt. 0: [2022-11-26 11:36:33,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_10-model_00-model_states.pt... 0: [2022-11-26 11:36:33,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_10-model_00-model_states.pt. 0: [2022-11-26 11:36:33,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_11-model_00-model_states.pt... 0: [2022-11-26 11:36:33,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_11-model_00-model_states.pt. 0: [2022-11-26 11:36:33,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_12-model_00-model_states.pt... 0: [2022-11-26 11:36:33,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_12-model_00-model_states.pt. 0: [2022-11-26 11:36:33,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_13-model_00-model_states.pt... 0: [2022-11-26 11:36:33,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_13-model_00-model_states.pt. 0: [2022-11-26 11:36:33,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_14-model_00-model_states.pt... 0: [2022-11-26 11:36:33,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_14-model_00-model_states.pt. 0: [2022-11-26 11:36:33,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_15-model_00-model_states.pt... 0: [2022-11-26 11:36:33,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_15-model_00-model_states.pt. 0: [2022-11-26 11:36:33,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_16-model_00-model_states.pt... 0: [2022-11-26 11:36:33,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_16-model_00-model_states.pt. 0: [2022-11-26 11:36:33,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_17-model_00-model_states.pt... 0: [2022-11-26 11:36:34,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_17-model_00-model_states.pt. 0: [2022-11-26 11:36:34,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_18-model_00-model_states.pt... 0: [2022-11-26 11:36:34,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_18-model_00-model_states.pt. 0: [2022-11-26 11:36:34,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_19-model_00-model_states.pt... 0: [2022-11-26 11:36:34,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_19-model_00-model_states.pt. 0: [2022-11-26 11:36:34,327] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_20-model_00-model_states.pt... 0: [2022-11-26 11:36:34,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_20-model_00-model_states.pt. 0: [2022-11-26 11:36:34,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_21-model_00-model_states.pt... 0: [2022-11-26 11:36:34,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_21-model_00-model_states.pt. 0: [2022-11-26 11:36:34,544] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_22-model_00-model_states.pt... 0: [2022-11-26 11:36:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_22-model_00-model_states.pt. 0: [2022-11-26 11:36:34,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_23-model_00-model_states.pt... 0: [2022-11-26 11:36:34,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_23-model_00-model_states.pt. 0: [2022-11-26 11:36:34,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_24-model_00-model_states.pt... 0: [2022-11-26 11:36:34,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_24-model_00-model_states.pt. 0: [2022-11-26 11:36:34,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_25-model_00-model_states.pt... 0: [2022-11-26 11:36:34,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_25-model_00-model_states.pt. 0: [2022-11-26 11:36:34,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_26-model_00-model_states.pt... 0: [2022-11-26 11:36:35,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_26-model_00-model_states.pt. 0: [2022-11-26 11:36:35,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_27-model_00-model_states.pt... 0: [2022-11-26 11:36:35,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_27-model_00-model_states.pt. 0: [2022-11-26 11:36:35,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_28-model_00-model_states.pt... 0: [2022-11-26 11:36:35,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_28-model_00-model_states.pt. 0: [2022-11-26 11:36:35,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_29-model_00-model_states.pt... 0: [2022-11-26 11:36:35,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_29-model_00-model_states.pt. 0: [2022-11-26 11:36:35,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_30-model_00-model_states.pt... 0: [2022-11-26 11:36:35,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_30-model_00-model_states.pt. 0: [2022-11-26 11:36:35,526] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/layer_32-model_00-model_states.pt... 0: [2022-11-26 11:36:35,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/layer_32-model_00-model_states.pt. 0: [2022-11-26 11:36:35,532] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step53000/mp_rank_00_model_states.pt 0: [2022-11-26 11:36:35,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/mp_rank_00_model_states.pt... 0: [2022-11-26 11:36:35,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/mp_rank_00_model_states.pt. 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 14: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 7: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 11: [2022-11-26 11:36:35,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step53000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 0: [2022-11-26 11:36:35,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 11:36:35,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:35,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:35,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 11:36:35,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:35,750] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,750] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:35,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,751] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:35,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:35,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:35,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:35,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:35,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:35,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:35,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 11:36:35,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 11:36:35,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:35,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 5: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 11:36:35,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:35,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:35,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:35,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:35,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:35,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:35,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:35,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:35,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:35,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:35,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:35,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:35,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:35,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:35,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 11: [2022-11-26 11:36:35,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 11:36:35,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 11:36:35,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:35,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:35,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:35,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:35,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: [2022-11-26 11:36:35,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 11:36:35,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 11:36:35,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:35,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 7: [2022-11-26 11:36:35,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 11:36:35,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 11:36:35,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:35,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:35,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 2: [2022-11-26 11:36:35,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 11:36:35,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 11:36:35,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 6: [2022-11-26 11:36:35,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 11:36:35,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 11:36:35,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:35,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:35,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 11:36:35,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:35,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 11:36:35,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 3: [2022-11-26 11:36:35,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 12: [2022-11-26 11:36:35,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 11:36:35,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 11:36:35,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 8: [2022-11-26 11:36:35,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 11:36:35,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 11:36:35,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 11:36:35,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:35,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:35,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 11:36:35,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 10: [2022-11-26 11:36:35,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:35,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:35,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 9: [2022-11-26 11:36:35,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 11:36:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 11:36:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 4: [2022-11-26 11:36:35,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 11:36:35,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 11:36:35,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 11:36:35,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 14: [2022-11-26 11:36:35,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 11:36:35,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 15: [2022-11-26 11:36:35,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 11:36:36,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 11:36:36,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 13: [2022-11-26 11:36:36,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 1: [2022-11-26 11:36:36,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step53000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 11:36:36,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step53000 is ready now! 0: successfully saved checkpoint at iteration 53000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3911.66 15: iteration 53010/ 125429 | consumed samples: 13570560 | consumed tokens: 27792506880 | elapsed time per iteration (s): 1.46 | learning rate: 1.333E-04 | global batch size: 256 | lm loss: 2.008418E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.882 | TFLOPs: 29.07 | 15: iteration 53020/ 125429 | consumed samples: 13573120 | consumed tokens: 27797749760 | elapsed time per iteration (s): 1.04 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.021519E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.068 | TFLOPs: 40.66 | 15: iteration 53030/ 125429 | consumed samples: 13575680 | consumed tokens: 27802992640 | elapsed time per iteration (s): 1.03 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.009321E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.225 | TFLOPs: 41.02 | 15: iteration 53040/ 125429 | consumed samples: 13578240 | consumed tokens: 27808235520 | elapsed time per iteration (s): 1.05 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 2.036835E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.341 | TFLOPs: 40.21 | 15: iteration 53050/ 125429 | consumed samples: 13580800 | consumed tokens: 27813478400 | elapsed time per iteration (s): 1.04 | learning rate: 1.332E-04 | global batch size: 256 | lm loss: 1.999989E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.944 | TFLOPs: 40.64 | 15: iteration 53060/ 125429 | consumed samples: 13583360 | consumed tokens: 27818721280 | elapsed time per iteration (s): 1.06 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.015891E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.364 | TFLOPs: 40.05 | 15: iteration 53070/ 125429 | consumed samples: 13585920 | consumed tokens: 27823964160 | elapsed time per iteration (s): 1.04 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.024146E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.259 | TFLOPs: 40.70 | 15: iteration 53080/ 125429 | consumed samples: 13588480 | consumed tokens: 27829207040 | elapsed time per iteration (s): 1.05 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 1.996259E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.885 | TFLOPs: 40.14 | 15: iteration 53090/ 125429 | consumed samples: 13591040 | consumed tokens: 27834449920 | elapsed time per iteration (s): 1.03 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.030980E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.680 | TFLOPs: 40.93 | 15: iteration 53100/ 125429 | consumed samples: 13593600 | consumed tokens: 27839692800 | elapsed time per iteration (s): 1.08 | learning rate: 1.331E-04 | global batch size: 256 | lm loss: 2.005602E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.583 | TFLOPs: 39.26 | 15: iteration 53110/ 125429 | consumed samples: 13596160 | consumed tokens: 27844935680 | elapsed time per iteration (s): 1.07 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.017710E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.848 | TFLOPs: 39.64 | 15: iteration 53120/ 125429 | consumed samples: 13598720 | consumed tokens: 27850178560 | elapsed time per iteration (s): 1.05 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.002615E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.409 | TFLOPs: 40.23 | 15: iteration 53130/ 125429 | consumed samples: 13601280 | consumed tokens: 27855421440 | elapsed time per iteration (s): 1.04 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 1.993041E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.030 | TFLOPs: 40.82 | 15: iteration 53140/ 125429 | consumed samples: 13603840 | consumed tokens: 27860664320 | elapsed time per iteration (s): 1.05 | learning rate: 1.330E-04 | global batch size: 256 | lm loss: 2.036253E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.578 | TFLOPs: 40.42 | 15: iteration 53150/ 125429 | consumed samples: 13606400 | consumed tokens: 27865907200 | elapsed time per iteration (s): 1.12 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.016414E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.503 | TFLOPs: 37.93 | 15: iteration 53160/ 125429 | consumed samples: 13608960 | consumed tokens: 27871150080 | elapsed time per iteration (s): 1.10 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.003613E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.712 | TFLOPs: 38.62 | 15: iteration 53170/ 125429 | consumed samples: 13611520 | consumed tokens: 27876392960 | elapsed time per iteration (s): 1.07 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.006545E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.307 | TFLOPs: 39.71 | 15: iteration 53180/ 125429 | consumed samples: 13614080 | consumed tokens: 27881635840 | elapsed time per iteration (s): 1.05 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.018684E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.099 | TFLOPs: 40.17 | 15: iteration 53190/ 125429 | consumed samples: 13616640 | consumed tokens: 27886878720 | elapsed time per iteration (s): 1.08 | learning rate: 1.329E-04 | global batch size: 256 | lm loss: 2.012576E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.054 | TFLOPs: 39.01 | 15: iteration 53200/ 125429 | consumed samples: 13619200 | consumed tokens: 27892121600 | elapsed time per iteration (s): 1.04 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.042054E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.125 | TFLOPs: 40.51 | 15: iteration 53210/ 125429 | consumed samples: 13621760 | consumed tokens: 27897364480 | elapsed time per iteration (s): 44.79 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.005106E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 5.715 | TFLOPs: 0.94 | 15: iteration 53220/ 125429 | consumed samples: 13624320 | consumed tokens: 27902607360 | elapsed time per iteration (s): 17.87 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 1.983426E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 14.325 | TFLOPs: 2.37 | 15: iteration 53230/ 125429 | consumed samples: 13626880 | consumed tokens: 27907850240 | elapsed time per iteration (s): 1.04 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.024813E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.117 | TFLOPs: 40.67 | 15: iteration 53240/ 125429 | consumed samples: 13629440 | consumed tokens: 27913093120 | elapsed time per iteration (s): 1.04 | learning rate: 1.328E-04 | global batch size: 256 | lm loss: 2.000425E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.212 | TFLOPs: 40.85 | 15: iteration 53250/ 125429 | consumed samples: 13632000 | consumed tokens: 27918336000 | elapsed time per iteration (s): 1.05 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.007473E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.691 | TFLOPs: 40.11 | 15: iteration 53260/ 125429 | consumed samples: 13634560 | consumed tokens: 27923578880 | elapsed time per iteration (s): 1.03 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 1.988755E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.500 | TFLOPs: 41.23 | 15: iteration 53270/ 125429 | consumed samples: 13637120 | consumed tokens: 27928821760 | elapsed time per iteration (s): 1.08 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 1.996314E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.155 | TFLOPs: 39.03 | 15: iteration 53280/ 125429 | consumed samples: 13639680 | consumed tokens: 27934064640 | elapsed time per iteration (s): 1.08 | learning rate: 1.327E-04 | global batch size: 256 | lm loss: 2.026133E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.606 | TFLOPs: 39.27 | 15: iteration 53290/ 125429 | consumed samples: 13642240 | consumed tokens: 27939307520 | elapsed time per iteration (s): 1.03 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.015920E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.027 | TFLOPs: 41.15 | 15: iteration 53300/ 125429 | consumed samples: 13644800 | consumed tokens: 27944550400 | elapsed time per iteration (s): 1.07 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.001909E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.334 | TFLOPs: 39.72 | 15: iteration 53310/ 125429 | consumed samples: 13647360 | consumed tokens: 27949793280 | elapsed time per iteration (s): 1.05 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.008980E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.681 | TFLOPs: 40.44 | 15: iteration 53320/ 125429 | consumed samples: 13649920 | consumed tokens: 27955036160 | elapsed time per iteration (s): 1.04 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 1.995908E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.698 | TFLOPs: 40.60 | 15: iteration 53330/ 125429 | consumed samples: 13652480 | consumed tokens: 27960279040 | elapsed time per iteration (s): 1.03 | learning rate: 1.326E-04 | global batch size: 256 | lm loss: 2.033487E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.477 | TFLOPs: 40.90 | 15: iteration 53340/ 125429 | consumed samples: 13655040 | consumed tokens: 27965521920 | elapsed time per iteration (s): 4.65 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.042088E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 55.080 | TFLOPs: 9.10 | 15: iteration 53350/ 125429 | consumed samples: 13657600 | consumed tokens: 27970764800 | elapsed time per iteration (s): 1.06 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.014488E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.622 | TFLOPs: 40.10 | 15: iteration 53360/ 125429 | consumed samples: 13660160 | consumed tokens: 27976007680 | elapsed time per iteration (s): 1.05 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 1.997633E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.874 | TFLOPs: 40.30 | 15: iteration 53370/ 125429 | consumed samples: 13662720 | consumed tokens: 27981250560 | elapsed time per iteration (s): 1.07 | learning rate: 1.325E-04 | global batch size: 256 | lm loss: 2.021233E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.889 | TFLOPs: 39.64 | 15: iteration 53380/ 125429 | consumed samples: 13665280 | consumed tokens: 27986493440 | elapsed time per iteration (s): 1.07 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.020391E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.202 | TFLOPs: 39.70 | 15: iteration 53390/ 125429 | consumed samples: 13667840 | consumed tokens: 27991736320 | elapsed time per iteration (s): 1.05 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 1.985290E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.998 | TFLOPs: 40.16 | 15: iteration 53400/ 125429 | consumed samples: 13670400 | consumed tokens: 27996979200 | elapsed time per iteration (s): 1.05 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.005025E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.477 | TFLOPs: 40.24 | 15: iteration 53410/ 125429 | consumed samples: 13672960 | consumed tokens: 28002222080 | elapsed time per iteration (s): 1.05 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.024303E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.454 | TFLOPs: 40.23 | 15: iteration 53420/ 125429 | consumed samples: 13675520 | consumed tokens: 28007464960 | elapsed time per iteration (s): 1.07 | learning rate: 1.324E-04 | global batch size: 256 | lm loss: 2.021589E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.266 | TFLOPs: 39.54 | 15: iteration 53430/ 125429 | consumed samples: 13678080 | consumed tokens: 28012707840 | elapsed time per iteration (s): 1.04 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 1.995873E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.253 | TFLOPs: 40.86 | 15: iteration 53440/ 125429 | consumed samples: 13680640 | consumed tokens: 28017950720 | elapsed time per iteration (s): 1.03 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 2.015265E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.171 | TFLOPs: 41.01 | 15: iteration 53450/ 125429 | consumed samples: 13683200 | consumed tokens: 28023193600 | elapsed time per iteration (s): 1.05 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 2.015690E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.135 | TFLOPs: 40.35 | 15: iteration 53460/ 125429 | consumed samples: 13685760 | consumed tokens: 28028436480 | elapsed time per iteration (s): 1.04 | learning rate: 1.323E-04 | global batch size: 256 | lm loss: 1.989869E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.895 | TFLOPs: 40.80 | 15: iteration 53470/ 125429 | consumed samples: 13688320 | consumed tokens: 28033679360 | elapsed time per iteration (s): 1.04 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.016356E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.058 | TFLOPs: 40.50 | 15: iteration 53480/ 125429 | consumed samples: 13690880 | consumed tokens: 28038922240 | elapsed time per iteration (s): 1.04 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.009314E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.422 | TFLOPs: 40.56 | 15: iteration 53490/ 125429 | consumed samples: 13693440 | consumed tokens: 28044165120 | elapsed time per iteration (s): 1.03 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.021993E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.175 | TFLOPs: 41.18 | 15: iteration 53500/ 125429 | consumed samples: 13696000 | consumed tokens: 28049408000 | elapsed time per iteration (s): 1.07 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.002422E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.517 | TFLOPs: 39.58 | 15: iteration 53510/ 125429 | consumed samples: 13698560 | consumed tokens: 28054650880 | elapsed time per iteration (s): 1.06 | learning rate: 1.322E-04 | global batch size: 256 | lm loss: 2.008187E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.459 | TFLOPs: 39.74 | 15: iteration 53520/ 125429 | consumed samples: 13701120 | consumed tokens: 28059893760 | elapsed time per iteration (s): 1.05 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.015568E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.016 | TFLOPs: 40.33 | 15: iteration 53530/ 125429 | consumed samples: 13703680 | consumed tokens: 28065136640 | elapsed time per iteration (s): 1.03 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.020368E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.993 | TFLOPs: 40.98 | 15: iteration 53540/ 125429 | consumed samples: 13706240 | consumed tokens: 28070379520 | elapsed time per iteration (s): 1.05 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.037988E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.856 | TFLOPs: 40.46 | 15: iteration 53550/ 125429 | consumed samples: 13708800 | consumed tokens: 28075622400 | elapsed time per iteration (s): 1.05 | learning rate: 1.321E-04 | global batch size: 256 | lm loss: 2.025999E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.293 | TFLOPs: 40.21 | 15: iteration 53560/ 125429 | consumed samples: 13711360 | consumed tokens: 28080865280 | elapsed time per iteration (s): 1.04 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 1.995932E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.988 | TFLOPs: 40.82 | 15: iteration 53570/ 125429 | consumed samples: 13713920 | consumed tokens: 28086108160 | elapsed time per iteration (s): 1.03 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 1.985478E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.558 | TFLOPs: 40.91 | 15: iteration 53580/ 125429 | consumed samples: 13716480 | consumed tokens: 28091351040 | elapsed time per iteration (s): 1.03 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 2.008753E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.840 | TFLOPs: 41.12 | 15: iteration 53590/ 125429 | consumed samples: 13719040 | consumed tokens: 28096593920 | elapsed time per iteration (s): 1.09 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 1.981150E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.558 | TFLOPs: 38.93 | 15: iteration 53600/ 125429 | consumed samples: 13721600 | consumed tokens: 28101836800 | elapsed time per iteration (s): 1.05 | learning rate: 1.320E-04 | global batch size: 256 | lm loss: 2.019168E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.042 | TFLOPs: 40.16 | 15: iteration 53610/ 125429 | consumed samples: 13724160 | consumed tokens: 28107079680 | elapsed time per iteration (s): 1.04 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.011280E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.755 | TFLOPs: 40.61 | 15: iteration 53620/ 125429 | consumed samples: 13726720 | consumed tokens: 28112322560 | elapsed time per iteration (s): 1.05 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.001195E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.152 | TFLOPs: 40.18 | 15: iteration 53630/ 125429 | consumed samples: 13729280 | consumed tokens: 28117565440 | elapsed time per iteration (s): 1.03 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 1.992396E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.233 | TFLOPs: 41.02 | 15: iteration 53640/ 125429 | consumed samples: 13731840 | consumed tokens: 28122808320 | elapsed time per iteration (s): 1.03 | learning rate: 1.319E-04 | global batch size: 256 | lm loss: 2.005303E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.227 | TFLOPs: 41.19 | 15: iteration 53650/ 125429 | consumed samples: 13734400 | consumed tokens: 28128051200 | elapsed time per iteration (s): 1.06 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.026526E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.358 | TFLOPs: 40.05 | 15: iteration 53660/ 125429 | consumed samples: 13736960 | consumed tokens: 28133294080 | elapsed time per iteration (s): 1.07 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.019895E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.397 | TFLOPs: 39.56 | 15: iteration 53670/ 125429 | consumed samples: 13739520 | consumed tokens: 28138536960 | elapsed time per iteration (s): 1.04 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.031153E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.769 | TFLOPs: 40.62 | 15: iteration 53680/ 125429 | consumed samples: 13742080 | consumed tokens: 28143779840 | elapsed time per iteration (s): 1.07 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.003282E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.516 | TFLOPs: 39.42 | 15: iteration 53690/ 125429 | consumed samples: 13744640 | consumed tokens: 28149022720 | elapsed time per iteration (s): 1.12 | learning rate: 1.318E-04 | global batch size: 256 | lm loss: 2.027371E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.862 | TFLOPs: 37.66 | 15: iteration 53700/ 125429 | consumed samples: 13747200 | consumed tokens: 28154265600 | elapsed time per iteration (s): 1.04 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.037066E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.307 | TFLOPs: 40.70 | 15: iteration 53710/ 125429 | consumed samples: 13749760 | consumed tokens: 28159508480 | elapsed time per iteration (s): 1.03 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.042470E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.558 | TFLOPs: 40.91 | 15: iteration 53720/ 125429 | consumed samples: 13752320 | consumed tokens: 28164751360 | elapsed time per iteration (s): 1.06 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.001259E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.975 | TFLOPs: 39.99 | 15: iteration 53730/ 125429 | consumed samples: 13754880 | consumed tokens: 28169994240 | elapsed time per iteration (s): 1.03 | learning rate: 1.317E-04 | global batch size: 256 | lm loss: 2.016272E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.281 | TFLOPs: 41.03 | 15: iteration 53740/ 125429 | consumed samples: 13757440 | consumed tokens: 28175237120 | elapsed time per iteration (s): 1.05 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 2.014165E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.938 | TFLOPs: 40.15 | 15: iteration 53750/ 125429 | consumed samples: 13760000 | consumed tokens: 28180480000 | elapsed time per iteration (s): 1.06 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 2.008216E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.499 | TFLOPs: 39.91 | 15: iteration 53760/ 125429 | consumed samples: 13762560 | consumed tokens: 28185722880 | elapsed time per iteration (s): 1.05 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 1.995013E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.008 | TFLOPs: 40.32 | 15: iteration 53770/ 125429 | consumed samples: 13765120 | consumed tokens: 28190965760 | elapsed time per iteration (s): 1.03 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 1.979017E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.148 | TFLOPs: 41.17 | 15: iteration 53780/ 125429 | consumed samples: 13767680 | consumed tokens: 28196208640 | elapsed time per iteration (s): 1.04 | learning rate: 1.316E-04 | global batch size: 256 | lm loss: 2.006468E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.118 | TFLOPs: 40.67 | 15: iteration 53790/ 125429 | consumed samples: 13770240 | consumed tokens: 28201451520 | elapsed time per iteration (s): 1.04 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 1.984533E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.928 | TFLOPs: 40.64 | 15: iteration 53800/ 125429 | consumed samples: 13772800 | consumed tokens: 28206694400 | elapsed time per iteration (s): 1.05 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.006776E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.892 | TFLOPs: 40.31 | 15: iteration 53810/ 125429 | consumed samples: 13775360 | consumed tokens: 28211937280 | elapsed time per iteration (s): 1.05 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 1.993239E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.706 | TFLOPs: 40.44 | 15: iteration 53820/ 125429 | consumed samples: 13777920 | consumed tokens: 28217180160 | elapsed time per iteration (s): 1.05 | learning rate: 1.315E-04 | global batch size: 256 | lm loss: 2.016720E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.771 | TFLOPs: 40.28 | 15: iteration 53830/ 125429 | consumed samples: 13780480 | consumed tokens: 28222423040 | elapsed time per iteration (s): 1.03 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.002556E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.527 | TFLOPs: 41.07 | 15: iteration 53840/ 125429 | consumed samples: 13783040 | consumed tokens: 28227665920 | elapsed time per iteration (s): 1.08 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 1.998317E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.819 | TFLOPs: 39.14 | 15: iteration 53850/ 125429 | consumed samples: 13785600 | consumed tokens: 28232908800 | elapsed time per iteration (s): 1.07 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 1.974868E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.432 | TFLOPs: 39.57 | 15: iteration 53860/ 125429 | consumed samples: 13788160 | consumed tokens: 28238151680 | elapsed time per iteration (s): 1.09 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 2.005255E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.336 | TFLOPs: 38.89 | 15: iteration 53870/ 125429 | consumed samples: 13790720 | consumed tokens: 28243394560 | elapsed time per iteration (s): 1.10 | learning rate: 1.314E-04 | global batch size: 256 | lm loss: 1.971096E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.741 | TFLOPs: 38.30 | 15: iteration 53880/ 125429 | consumed samples: 13793280 | consumed tokens: 28248637440 | elapsed time per iteration (s): 1.07 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.006896E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.505 | TFLOPs: 39.41 | 15: iteration 53890/ 125429 | consumed samples: 13795840 | consumed tokens: 28253880320 | elapsed time per iteration (s): 1.06 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.013402E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.690 | TFLOPs: 39.94 | 15: iteration 53900/ 125429 | consumed samples: 13798400 | consumed tokens: 28259123200 | elapsed time per iteration (s): 1.06 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.006940E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.638 | TFLOPs: 39.93 | 15: iteration 53910/ 125429 | consumed samples: 13800960 | consumed tokens: 28264366080 | elapsed time per iteration (s): 1.06 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 2.010227E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.719 | TFLOPs: 39.78 | 15: iteration 53920/ 125429 | consumed samples: 13803520 | consumed tokens: 28269608960 | elapsed time per iteration (s): 1.04 | learning rate: 1.313E-04 | global batch size: 256 | lm loss: 1.997022E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.167 | TFLOPs: 40.68 | 15: iteration 53930/ 125429 | consumed samples: 13806080 | consumed tokens: 28274851840 | elapsed time per iteration (s): 1.29 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 1.992789E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 197.764 | TFLOPs: 32.68 | 15: iteration 53940/ 125429 | consumed samples: 13808640 | consumed tokens: 28280094720 | elapsed time per iteration (s): 1.03 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 1.994184E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.827 | TFLOPs: 41.12 | 15: iteration 53950/ 125429 | consumed samples: 13811200 | consumed tokens: 28285337600 | elapsed time per iteration (s): 1.04 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 1.969619E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.242 | TFLOPs: 40.86 | 15: iteration 53960/ 125429 | consumed samples: 13813760 | consumed tokens: 28290580480 | elapsed time per iteration (s): 1.04 | learning rate: 1.312E-04 | global batch size: 256 | lm loss: 2.024325E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.767 | TFLOPs: 40.61 | 15: iteration 53970/ 125429 | consumed samples: 13816320 | consumed tokens: 28295823360 | elapsed time per iteration (s): 1.02 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.025357E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.292 | TFLOPs: 41.36 | 15: iteration 53980/ 125429 | consumed samples: 13818880 | consumed tokens: 28301066240 | elapsed time per iteration (s): 1.06 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.013163E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.826 | TFLOPs: 39.96 | 15: iteration 53990/ 125429 | consumed samples: 13821440 | consumed tokens: 28306309120 | elapsed time per iteration (s): 1.18 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.012445E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.149 | TFLOPs: 35.72 | 0: [2022-11-26 12:04:53,163] [INFO] [logging.py:68:log_dist] [Rank 0] step=54000, skipped=0, lr=[0.0001310733245107119, 0.0001310733245107119, 0.0001310733245107119], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 54000/ 125429 | consumed samples: 13824000 | consumed tokens: 28311552000 | elapsed time per iteration (s): 1.04 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 2.007328E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.313 | TFLOPs: 40.87 | 0: steps: 54000 loss: 1.9957 iter time (s): 1.395 samples/sec: 183.457 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 54000 | lm loss value: 1.973321E+00 | lm loss PPL: 7.194531E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 54000 to checkpoints_1b5 0: [2022-11-26 12:04:53,542] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step54000 is begin to save! 0: [2022-11-26 12:04:53,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:04:53,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:04:53,848] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:04:53,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:04:53,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:04:54,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:04:54,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:04:54,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:04:54,195] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:04:54,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:04:54,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:04:54,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:04:54,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:04:54,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:04:54,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:04:54,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:04:54,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:04:54,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:04:54,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:04:54,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:04:54,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:04:54,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:04:54,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:04:55,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:04:55,083] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:04:55,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:04:55,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:04:55,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:04:55,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:04:55,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:04:55,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:04:55,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:04:55,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:04:55,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:04:55,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:04:55,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:04:55,740] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:04:55,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:04:55,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:04:55,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:04:55,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:04:56,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:04:56,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:04:56,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:04:56,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:04:56,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:04:56,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:04:56,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:04:56,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:04:56,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:04:56,509] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:04:56,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:04:56,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:04:56,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:04:56,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_29-model_00-model_states.pt... 0: [2022-11-26 12:04:56,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_29-model_00-model_states.pt. 0: [2022-11-26 12:04:56,835] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:04:56,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:04:56,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/layer_32-model_00-model_states.pt... 0: [2022-11-26 12:04:56,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/layer_32-model_00-model_states.pt. 0: [2022-11-26 12:04:56,945] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step54000/mp_rank_00_model_states.pt 0: [2022-11-26 12:04:56,945] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:04:56,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:04:56,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step54000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:04:57,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:04:57,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 12:04:57,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:04:57,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:04:57,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 12:04:57,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:04:57,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:04:57,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:04:57,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:04:57,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 12:04:57,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:04:57,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 12:04:57,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:04:57,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 12:04:57,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:04:57,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 12:04:57,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 12:04:57,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:04:57,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 12:04:57,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 12:04:57,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:04:57,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:04:57,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:04:57,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 7: [2022-11-26 12:04:57,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:04:57,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:04:57,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:04:57,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:04:57,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:04:57,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:04:57,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:04:57,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 14: [2022-11-26 12:04:57,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:04:57,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:04:57,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:04:57,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:04:57,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:04:57,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:04:57,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:04:57,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 12:04:57,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 9: [2022-11-26 12:04:57,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:04:57,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 12:04:57,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 14: [2022-11-26 12:04:57,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:04:57,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 12:04:57,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:04:57,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:04:57,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 8: [2022-11-26 12:04:57,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:04:57,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 11: [2022-11-26 12:04:57,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:04:57,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:04:57,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 10: [2022-11-26 12:04:57,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:04:57,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 12:04:57,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:04:57,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 12:04:57,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 12:04:57,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 2: [2022-11-26 12:04:57,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:04:57,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 12:04:57,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 13: [2022-11-26 12:04:57,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:04:57,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:04:57,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 12: [2022-11-26 12:04:57,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:04:57,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 12:04:57,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:04:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 15: [2022-11-26 12:04:57,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:04:57,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 3: [2022-11-26 12:04:57,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:04:57,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:04:57,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 1: [2022-11-26 12:04:57,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:04:57,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 12:04:57,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 4: [2022-11-26 12:04:57,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 5: [2022-11-26 12:04:57,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: [2022-11-26 12:04:57,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:04:57,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 6: [2022-11-26 12:04:57,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step54000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 12:04:57,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step54000 is ready now! 0: successfully saved checkpoint at iteration 54000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3929.14 15: iteration 54010/ 125429 | consumed samples: 13826560 | consumed tokens: 28316794880 | elapsed time per iteration (s): 1.49 | learning rate: 1.311E-04 | global batch size: 256 | lm loss: 1.995210E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.359 | TFLOPs: 28.32 | 15: iteration 54020/ 125429 | consumed samples: 13829120 | consumed tokens: 28322037760 | elapsed time per iteration (s): 1.06 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 1.973698E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.756 | TFLOPs: 39.95 | 15: iteration 54030/ 125429 | consumed samples: 13831680 | consumed tokens: 28327280640 | elapsed time per iteration (s): 1.05 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 1.972032E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.966 | TFLOPs: 40.32 | 15: iteration 54040/ 125429 | consumed samples: 13834240 | consumed tokens: 28332523520 | elapsed time per iteration (s): 1.06 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 2.011706E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.244 | TFLOPs: 40.03 | 15: iteration 54050/ 125429 | consumed samples: 13836800 | consumed tokens: 28337766400 | elapsed time per iteration (s): 1.03 | learning rate: 1.310E-04 | global batch size: 256 | lm loss: 1.992288E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.872 | TFLOPs: 41.13 | 15: iteration 54060/ 125429 | consumed samples: 13839360 | consumed tokens: 28343009280 | elapsed time per iteration (s): 1.04 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 1.994643E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.203 | TFLOPs: 40.85 | 15: iteration 54070/ 125429 | consumed samples: 13841920 | consumed tokens: 28348252160 | elapsed time per iteration (s): 1.04 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 1.973419E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.259 | TFLOPs: 40.86 | 15: iteration 54080/ 125429 | consumed samples: 13844480 | consumed tokens: 28353495040 | elapsed time per iteration (s): 1.07 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.010538E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.246 | TFLOPs: 39.54 | 15: iteration 54090/ 125429 | consumed samples: 13847040 | consumed tokens: 28358737920 | elapsed time per iteration (s): 1.06 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.033520E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.578 | TFLOPs: 40.09 | 15: iteration 54100/ 125429 | consumed samples: 13849600 | consumed tokens: 28363980800 | elapsed time per iteration (s): 1.05 | learning rate: 1.309E-04 | global batch size: 256 | lm loss: 2.010893E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.194 | TFLOPs: 40.19 | 15: iteration 54110/ 125429 | consumed samples: 13852160 | consumed tokens: 28369223680 | elapsed time per iteration (s): 1.06 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 1.972138E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.216 | TFLOPs: 39.86 | 15: iteration 54120/ 125429 | consumed samples: 13854720 | consumed tokens: 28374466560 | elapsed time per iteration (s): 1.06 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 2.007577E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.451 | TFLOPs: 40.07 | 15: iteration 54130/ 125429 | consumed samples: 13857280 | consumed tokens: 28379709440 | elapsed time per iteration (s): 1.06 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 2.033363E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.466 | TFLOPs: 40.07 | 15: iteration 54140/ 125429 | consumed samples: 13859840 | consumed tokens: 28384952320 | elapsed time per iteration (s): 1.06 | learning rate: 1.308E-04 | global batch size: 256 | lm loss: 2.009956E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.608 | TFLOPs: 39.93 | 15: iteration 54150/ 125429 | consumed samples: 13862400 | consumed tokens: 28390195200 | elapsed time per iteration (s): 1.06 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.007158E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.749 | TFLOPs: 39.95 | 15: iteration 54160/ 125429 | consumed samples: 13864960 | consumed tokens: 28395438080 | elapsed time per iteration (s): 1.06 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 1.997855E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.578 | TFLOPs: 39.92 | 15: iteration 54170/ 125429 | consumed samples: 13867520 | consumed tokens: 28400680960 | elapsed time per iteration (s): 1.07 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.016235E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.349 | TFLOPs: 39.72 | 15: iteration 54180/ 125429 | consumed samples: 13870080 | consumed tokens: 28405923840 | elapsed time per iteration (s): 1.02 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 1.991652E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.088 | TFLOPs: 41.33 | 15: iteration 54190/ 125429 | consumed samples: 13872640 | consumed tokens: 28411166720 | elapsed time per iteration (s): 1.05 | learning rate: 1.307E-04 | global batch size: 256 | lm loss: 2.017773E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.312 | TFLOPs: 40.37 | 15: iteration 54200/ 125429 | consumed samples: 13875200 | consumed tokens: 28416409600 | elapsed time per iteration (s): 1.05 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 2.004084E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.448 | TFLOPs: 40.23 | 15: iteration 54210/ 125429 | consumed samples: 13877760 | consumed tokens: 28421652480 | elapsed time per iteration (s): 1.05 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 1.994922E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.582 | TFLOPs: 40.25 | 15: iteration 54220/ 125429 | consumed samples: 13880320 | consumed tokens: 28426895360 | elapsed time per iteration (s): 1.04 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 1.988982E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.046 | TFLOPs: 40.83 | 15: iteration 54230/ 125429 | consumed samples: 13882880 | consumed tokens: 28432138240 | elapsed time per iteration (s): 1.05 | learning rate: 1.306E-04 | global batch size: 256 | lm loss: 1.999570E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.756 | TFLOPs: 40.28 | 15: iteration 54240/ 125429 | consumed samples: 13885440 | consumed tokens: 28437381120 | elapsed time per iteration (s): 1.03 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 1.980432E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.183 | TFLOPs: 41.18 | 15: iteration 54250/ 125429 | consumed samples: 13888000 | consumed tokens: 28442624000 | elapsed time per iteration (s): 1.03 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.013410E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.247 | TFLOPs: 41.02 | 15: iteration 54260/ 125429 | consumed samples: 13890560 | consumed tokens: 28447866880 | elapsed time per iteration (s): 1.04 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.029507E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.155 | TFLOPs: 40.84 | 15: iteration 54270/ 125429 | consumed samples: 13893120 | consumed tokens: 28453109760 | elapsed time per iteration (s): 1.04 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.016142E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.982 | TFLOPs: 40.65 | 15: iteration 54280/ 125429 | consumed samples: 13895680 | consumed tokens: 28458352640 | elapsed time per iteration (s): 1.03 | learning rate: 1.305E-04 | global batch size: 256 | lm loss: 2.050677E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.443 | TFLOPs: 41.06 | 15: iteration 54290/ 125429 | consumed samples: 13898240 | consumed tokens: 28463595520 | elapsed time per iteration (s): 1.03 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 2.036339E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.312 | TFLOPs: 41.04 | 15: iteration 54300/ 125429 | consumed samples: 13900800 | consumed tokens: 28468838400 | elapsed time per iteration (s): 1.04 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 2.014383E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.211 | TFLOPs: 40.85 | 15: iteration 54310/ 125429 | consumed samples: 13903360 | consumed tokens: 28474081280 | elapsed time per iteration (s): 1.02 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 2.026395E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.792 | TFLOPs: 41.28 | 15: iteration 54320/ 125429 | consumed samples: 13905920 | consumed tokens: 28479324160 | elapsed time per iteration (s): 1.03 | learning rate: 1.304E-04 | global batch size: 256 | lm loss: 1.977318E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.873 | TFLOPs: 41.13 | 15: iteration 54330/ 125429 | consumed samples: 13908480 | consumed tokens: 28484567040 | elapsed time per iteration (s): 1.06 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 1.995664E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.315 | TFLOPs: 40.04 | 15: iteration 54340/ 125429 | consumed samples: 13911040 | consumed tokens: 28489809920 | elapsed time per iteration (s): 1.04 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 1.998430E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.230 | TFLOPs: 40.86 | 15: iteration 54350/ 125429 | consumed samples: 13913600 | consumed tokens: 28495052800 | elapsed time per iteration (s): 1.02 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.019976E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.255 | TFLOPs: 41.36 | 15: iteration 54360/ 125429 | consumed samples: 13916160 | consumed tokens: 28500295680 | elapsed time per iteration (s): 1.04 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.030976E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.383 | TFLOPs: 40.72 | 15: iteration 54370/ 125429 | consumed samples: 13918720 | consumed tokens: 28505538560 | elapsed time per iteration (s): 1.05 | learning rate: 1.303E-04 | global batch size: 256 | lm loss: 2.016358E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.374 | TFLOPs: 40.22 | 15: iteration 54380/ 125429 | consumed samples: 13921280 | consumed tokens: 28510781440 | elapsed time per iteration (s): 1.05 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 1.984502E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.875 | TFLOPs: 40.47 | 15: iteration 54390/ 125429 | consumed samples: 13923840 | consumed tokens: 28516024320 | elapsed time per iteration (s): 1.07 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.007777E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.567 | TFLOPs: 39.59 | 15: iteration 54400/ 125429 | consumed samples: 13926400 | consumed tokens: 28521267200 | elapsed time per iteration (s): 1.08 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 2.012388E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.724 | TFLOPs: 39.12 | 15: iteration 54410/ 125429 | consumed samples: 13928960 | consumed tokens: 28526510080 | elapsed time per iteration (s): 1.03 | learning rate: 1.302E-04 | global batch size: 256 | lm loss: 1.970018E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.661 | TFLOPs: 41.09 | 15: iteration 54420/ 125429 | consumed samples: 13931520 | consumed tokens: 28531752960 | elapsed time per iteration (s): 1.03 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 1.985202E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.507 | TFLOPs: 41.23 | 15: iteration 54430/ 125429 | consumed samples: 13934080 | consumed tokens: 28536995840 | elapsed time per iteration (s): 1.05 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 2.000585E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.926 | TFLOPs: 40.15 | 15: iteration 54440/ 125429 | consumed samples: 13936640 | consumed tokens: 28542238720 | elapsed time per iteration (s): 1.04 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 2.029654E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.840 | TFLOPs: 40.63 | 15: iteration 54450/ 125429 | consumed samples: 13939200 | consumed tokens: 28547481600 | elapsed time per iteration (s): 1.02 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 2.025979E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.958 | TFLOPs: 41.31 | 15: iteration 54460/ 125429 | consumed samples: 13941760 | consumed tokens: 28552724480 | elapsed time per iteration (s): 1.05 | learning rate: 1.301E-04 | global batch size: 256 | lm loss: 1.996481E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.968 | TFLOPs: 40.15 | 15: iteration 54470/ 125429 | consumed samples: 13944320 | consumed tokens: 28557967360 | elapsed time per iteration (s): 1.06 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 2.028954E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.466 | TFLOPs: 40.07 | 15: iteration 54480/ 125429 | consumed samples: 13946880 | consumed tokens: 28563210240 | elapsed time per iteration (s): 1.03 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 1.990191E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.903 | TFLOPs: 40.97 | 15: iteration 54490/ 125429 | consumed samples: 13949440 | consumed tokens: 28568453120 | elapsed time per iteration (s): 1.03 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 2.014252E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.926 | TFLOPs: 41.14 | 15: iteration 54500/ 125429 | consumed samples: 13952000 | consumed tokens: 28573696000 | elapsed time per iteration (s): 1.04 | learning rate: 1.300E-04 | global batch size: 256 | lm loss: 1.989742E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.119 | TFLOPs: 40.84 | 15: iteration 54510/ 125429 | consumed samples: 13954560 | consumed tokens: 28578938880 | elapsed time per iteration (s): 1.05 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 2.011623E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.435 | TFLOPs: 40.23 | 15: iteration 54520/ 125429 | consumed samples: 13957120 | consumed tokens: 28584181760 | elapsed time per iteration (s): 1.04 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 1.994408E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.311 | TFLOPs: 40.54 | 15: iteration 54530/ 125429 | consumed samples: 13959680 | consumed tokens: 28589424640 | elapsed time per iteration (s): 1.04 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 2.003994E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.243 | TFLOPs: 40.86 | 15: iteration 54540/ 125429 | consumed samples: 13962240 | consumed tokens: 28594667520 | elapsed time per iteration (s): 1.03 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 1.981714E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.971 | TFLOPs: 40.98 | 15: iteration 54550/ 125429 | consumed samples: 13964800 | consumed tokens: 28599910400 | elapsed time per iteration (s): 1.06 | learning rate: 1.299E-04 | global batch size: 256 | lm loss: 1.992336E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.603 | TFLOPs: 39.93 | 15: iteration 54560/ 125429 | consumed samples: 13967360 | consumed tokens: 28605153280 | elapsed time per iteration (s): 1.02 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.006207E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.006 | TFLOPs: 41.32 | 15: iteration 54570/ 125429 | consumed samples: 13969920 | consumed tokens: 28610396160 | elapsed time per iteration (s): 1.06 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 1.985166E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.417 | TFLOPs: 39.90 | 15: iteration 54580/ 125429 | consumed samples: 13972480 | consumed tokens: 28615639040 | elapsed time per iteration (s): 1.04 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 1.986658E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.040 | TFLOPs: 40.66 | 15: iteration 54590/ 125429 | consumed samples: 13975040 | consumed tokens: 28620881920 | elapsed time per iteration (s): 1.08 | learning rate: 1.298E-04 | global batch size: 256 | lm loss: 2.007388E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.755 | TFLOPs: 39.29 | 15: iteration 54600/ 125429 | consumed samples: 13977600 | consumed tokens: 28626124800 | elapsed time per iteration (s): 2.74 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.028029E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 93.450 | TFLOPs: 15.44 | 15: iteration 54610/ 125429 | consumed samples: 13980160 | consumed tokens: 28631367680 | elapsed time per iteration (s): 1.05 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 1.992096E+00 | grad norm: 0.122 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.484 | TFLOPs: 40.24 | 15: iteration 54620/ 125429 | consumed samples: 13982720 | consumed tokens: 28636610560 | elapsed time per iteration (s): 1.05 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 1.999022E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.310 | TFLOPs: 40.21 | 15: iteration 54630/ 125429 | consumed samples: 13985280 | consumed tokens: 28641853440 | elapsed time per iteration (s): 1.05 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.000931E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.401 | TFLOPs: 40.39 | 15: iteration 54640/ 125429 | consumed samples: 13987840 | consumed tokens: 28647096320 | elapsed time per iteration (s): 1.09 | learning rate: 1.297E-04 | global batch size: 256 | lm loss: 2.010567E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.823 | TFLOPs: 38.64 | 15: iteration 54650/ 125429 | consumed samples: 13990400 | consumed tokens: 28652339200 | elapsed time per iteration (s): 1.04 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 1.974742E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.005 | TFLOPs: 40.65 | 15: iteration 54660/ 125429 | consumed samples: 13992960 | consumed tokens: 28657582080 | elapsed time per iteration (s): 1.03 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 1.980073E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.487 | TFLOPs: 41.06 | 15: iteration 54670/ 125429 | consumed samples: 13995520 | consumed tokens: 28662824960 | elapsed time per iteration (s): 1.04 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 2.002198E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.826 | TFLOPs: 40.62 | 15: iteration 54680/ 125429 | consumed samples: 13998080 | consumed tokens: 28668067840 | elapsed time per iteration (s): 1.03 | learning rate: 1.296E-04 | global batch size: 256 | lm loss: 1.994743E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.378 | TFLOPs: 41.21 | 15: iteration 54690/ 125429 | consumed samples: 14000640 | consumed tokens: 28673310720 | elapsed time per iteration (s): 1.05 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 1.992878E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.882 | TFLOPs: 40.14 | 15: iteration 54700/ 125429 | consumed samples: 14003200 | consumed tokens: 28678553600 | elapsed time per iteration (s): 1.04 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.034963E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.583 | TFLOPs: 40.58 | 15: iteration 54710/ 125429 | consumed samples: 14005760 | consumed tokens: 28683796480 | elapsed time per iteration (s): 1.05 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.002967E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.791 | TFLOPs: 40.45 | 15: iteration 54720/ 125429 | consumed samples: 14008320 | consumed tokens: 28689039360 | elapsed time per iteration (s): 1.03 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.000892E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.154 | TFLOPs: 41.01 | 15: iteration 54730/ 125429 | consumed samples: 14010880 | consumed tokens: 28694282240 | elapsed time per iteration (s): 1.08 | learning rate: 1.295E-04 | global batch size: 256 | lm loss: 2.010283E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.925 | TFLOPs: 39.32 | 15: iteration 54740/ 125429 | consumed samples: 14013440 | consumed tokens: 28699525120 | elapsed time per iteration (s): 1.05 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.015007E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.136 | TFLOPs: 40.35 | 15: iteration 54750/ 125429 | consumed samples: 14016000 | consumed tokens: 28704768000 | elapsed time per iteration (s): 1.06 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.008095E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.371 | TFLOPs: 40.05 | 15: iteration 54760/ 125429 | consumed samples: 14018560 | consumed tokens: 28710010880 | elapsed time per iteration (s): 1.07 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 1.967501E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.054 | TFLOPs: 39.51 | 15: iteration 54770/ 125429 | consumed samples: 14021120 | consumed tokens: 28715253760 | elapsed time per iteration (s): 1.03 | learning rate: 1.294E-04 | global batch size: 256 | lm loss: 2.029998E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.530 | TFLOPs: 41.24 | 15: iteration 54780/ 125429 | consumed samples: 14023680 | consumed tokens: 28720496640 | elapsed time per iteration (s): 1.05 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 2.013829E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.273 | TFLOPs: 40.20 | 15: iteration 54790/ 125429 | consumed samples: 14026240 | consumed tokens: 28725739520 | elapsed time per iteration (s): 1.05 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 1.997198E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.680 | TFLOPs: 40.10 | 15: iteration 54800/ 125429 | consumed samples: 14028800 | consumed tokens: 28730982400 | elapsed time per iteration (s): 1.07 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 2.015379E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.530 | TFLOPs: 39.58 | 15: iteration 54810/ 125429 | consumed samples: 14031360 | consumed tokens: 28736225280 | elapsed time per iteration (s): 1.04 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 2.022409E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.965 | TFLOPs: 40.81 | 15: iteration 54820/ 125429 | consumed samples: 14033920 | consumed tokens: 28741468160 | elapsed time per iteration (s): 1.07 | learning rate: 1.293E-04 | global batch size: 256 | lm loss: 1.983385E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.293 | TFLOPs: 39.54 | 15: iteration 54830/ 125429 | consumed samples: 14036480 | consumed tokens: 28746711040 | elapsed time per iteration (s): 1.07 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 1.987232E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.744 | TFLOPs: 39.62 | 15: iteration 54840/ 125429 | consumed samples: 14039040 | consumed tokens: 28751953920 | elapsed time per iteration (s): 1.03 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.021269E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.786 | TFLOPs: 40.95 | 15: iteration 54850/ 125429 | consumed samples: 14041600 | consumed tokens: 28757196800 | elapsed time per iteration (s): 1.04 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 1.990454E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.187 | TFLOPs: 40.68 | 15: iteration 54860/ 125429 | consumed samples: 14044160 | consumed tokens: 28762439680 | elapsed time per iteration (s): 1.05 | learning rate: 1.292E-04 | global batch size: 256 | lm loss: 2.010542E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.861 | TFLOPs: 40.13 | 15: iteration 54870/ 125429 | consumed samples: 14046720 | consumed tokens: 28767682560 | elapsed time per iteration (s): 1.04 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 1.998638E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.145 | TFLOPs: 40.84 | 15: iteration 54880/ 125429 | consumed samples: 14049280 | consumed tokens: 28772925440 | elapsed time per iteration (s): 1.05 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 1.996156E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.829 | TFLOPs: 40.29 | 15: iteration 54890/ 125429 | consumed samples: 14051840 | consumed tokens: 28778168320 | elapsed time per iteration (s): 1.05 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 1.992350E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.982 | TFLOPs: 40.15 | 15: iteration 54900/ 125429 | consumed samples: 14054400 | consumed tokens: 28783411200 | elapsed time per iteration (s): 1.03 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 1.988760E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.722 | TFLOPs: 40.94 | 15: iteration 54910/ 125429 | consumed samples: 14056960 | consumed tokens: 28788654080 | elapsed time per iteration (s): 1.04 | learning rate: 1.291E-04 | global batch size: 256 | lm loss: 2.001228E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.385 | TFLOPs: 40.72 | 15: iteration 54920/ 125429 | consumed samples: 14059520 | consumed tokens: 28793896960 | elapsed time per iteration (s): 1.05 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 2.029743E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.297 | TFLOPs: 40.37 | 15: iteration 54930/ 125429 | consumed samples: 14062080 | consumed tokens: 28799139840 | elapsed time per iteration (s): 1.04 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 2.033541E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.115 | TFLOPs: 40.84 | 15: iteration 54940/ 125429 | consumed samples: 14064640 | consumed tokens: 28804382720 | elapsed time per iteration (s): 1.06 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 1.982303E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.892 | TFLOPs: 39.97 | 15: iteration 54950/ 125429 | consumed samples: 14067200 | consumed tokens: 28809625600 | elapsed time per iteration (s): 1.05 | learning rate: 1.290E-04 | global batch size: 256 | lm loss: 1.998177E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.774 | TFLOPs: 40.29 | 15: iteration 54960/ 125429 | consumed samples: 14069760 | consumed tokens: 28814868480 | elapsed time per iteration (s): 1.03 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.029045E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.311 | TFLOPs: 41.20 | 15: iteration 54970/ 125429 | consumed samples: 14072320 | consumed tokens: 28820111360 | elapsed time per iteration (s): 1.03 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.020179E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.875 | TFLOPs: 40.96 | 15: iteration 54980/ 125429 | consumed samples: 14074880 | consumed tokens: 28825354240 | elapsed time per iteration (s): 1.03 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 1.999850E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.268 | TFLOPs: 41.03 | 15: iteration 54990/ 125429 | consumed samples: 14077440 | consumed tokens: 28830597120 | elapsed time per iteration (s): 1.05 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 2.024710E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.733 | TFLOPs: 40.11 | 15: iteration 55000/ 125429 | consumed samples: 14080000 | consumed tokens: 28835840000 | elapsed time per iteration (s): 1.05 | learning rate: 1.289E-04 | global batch size: 256 | lm loss: 1.982943E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.423 | TFLOPs: 40.23 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 55000 | lm loss value: 1.933530E+00 | lm loss PPL: 6.913871E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 55000 to checkpoints_1b5 0: [2022-11-26 12:22:41,122] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step55000 is begin to save! 0: [2022-11-26 12:22:41,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:22:41,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:22:41,385] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:22:41,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:22:41,494] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:22:41,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:22:41,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:22:41,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:22:41,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:22:41,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:22:41,817] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:22:41,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:22:41,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:22:42,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:22:42,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:22:42,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:22:42,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:22:42,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:22:42,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:22:42,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:22:42,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:22:42,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:22:42,450] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:22:42,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:22:42,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:22:42,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:22:42,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:22:42,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:22:42,768] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:22:42,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:22:42,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:22:42,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:22:42,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:22:43,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:22:43,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:22:43,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:22:43,205] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:22:43,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:22:43,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:22:43,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:22:43,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:22:43,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:22:43,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:22:43,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:22:43,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:22:43,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:22:43,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:22:43,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:22:43,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:22:43,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:22:43,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:22:44,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:22:44,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:22:44,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:22:44,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_29-model_00-model_states.pt... 0: [2022-11-26 12:22:44,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_29-model_00-model_states.pt. 0: [2022-11-26 12:22:44,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:22:44,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:22:44,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/layer_32-model_00-model_states.pt... 0: [2022-11-26 12:22:44,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/layer_32-model_00-model_states.pt. 0: [2022-11-26 12:22:44,392] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step55000/mp_rank_00_model_states.pt 0: [2022-11-26 12:22:44,392] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:22:44,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:22:44,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step55000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:22:44,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:22:44,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 12:22:44,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 12:22:44,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:22:44,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 12: [2022-11-26 12:22:44,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:22:44,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 12:22:44,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:22:44,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:22:44,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 12:22:44,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 12:22:44,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 12:22:44,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 12:22:44,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 12:22:44,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 5: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 12:22:44,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:22:44,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:22:44,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 12:22:44,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:22:44,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:22:44,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:22:44,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:22:44,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:22:44,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:22:44,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 12:22:44,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 12:22:44,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:22:44,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 12:22:44,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 12:22:44,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:22:44,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:22:44,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 12:22:44,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 6: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:22:44,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 12: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 12:22:44,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 3: [2022-11-26 12:22:44,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:22:44,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 14: [2022-11-26 12:22:44,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:22:44,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 12:22:44,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 5: [2022-11-26 12:22:44,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:22:44,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:22:44,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 8: [2022-11-26 12:22:44,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:22:44,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 12:22:44,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:22:44,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:22:44,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 12:22:44,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 12:22:44,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 12:22:44,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 12:22:44,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 9: [2022-11-26 12:22:44,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:22:44,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 12:22:44,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 1: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:22:44,646] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 12:22:44,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 12:22:44,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 12:22:44,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 7: [2022-11-26 12:22:44,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:22:44,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:22:44,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 15: [2022-11-26 12:22:44,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:22:44,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 12:22:44,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:22:44,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:22:44,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:22:44,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 10: [2022-11-26 12:22:44,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 4: [2022-11-26 12:22:44,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:22:44,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 12:22:44,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 12:22:44,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:22:44,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 13: [2022-11-26 12:22:44,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:22:44,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:22:44,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:22:44,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 12:22:44,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: [2022-11-26 12:22:44,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:22:44,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 11: [2022-11-26 12:22:44,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step55000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:22:44,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step55000 is ready now! 0: successfully saved checkpoint at iteration 55000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3770.74 15: iteration 55010/ 125429 | consumed samples: 14082560 | consumed tokens: 28841082880 | elapsed time per iteration (s): 1.49 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 2.019730E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.094 | TFLOPs: 28.44 | 15: iteration 55020/ 125429 | consumed samples: 14085120 | consumed tokens: 28846325760 | elapsed time per iteration (s): 1.05 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 1.962306E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.159 | TFLOPs: 40.35 | 15: iteration 55030/ 125429 | consumed samples: 14087680 | consumed tokens: 28851568640 | elapsed time per iteration (s): 1.07 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 2.009971E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.508 | TFLOPs: 39.58 | 15: iteration 55040/ 125429 | consumed samples: 14090240 | consumed tokens: 28856811520 | elapsed time per iteration (s): 1.06 | learning rate: 1.288E-04 | global batch size: 256 | lm loss: 1.993914E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.359 | TFLOPs: 40.05 | 15: iteration 55050/ 125429 | consumed samples: 14092800 | consumed tokens: 28862054400 | elapsed time per iteration (s): 1.05 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.001731E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.131 | TFLOPs: 40.18 | 15: iteration 55060/ 125429 | consumed samples: 14095360 | consumed tokens: 28867297280 | elapsed time per iteration (s): 1.04 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.003296E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.784 | TFLOPs: 40.62 | 15: iteration 55070/ 125429 | consumed samples: 14097920 | consumed tokens: 28872540160 | elapsed time per iteration (s): 1.03 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.021185E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.192 | TFLOPs: 41.18 | 15: iteration 55080/ 125429 | consumed samples: 14100480 | consumed tokens: 28877783040 | elapsed time per iteration (s): 1.08 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 1.996923E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.579 | TFLOPs: 39.10 | 15: iteration 55090/ 125429 | consumed samples: 14103040 | consumed tokens: 28883025920 | elapsed time per iteration (s): 1.07 | learning rate: 1.287E-04 | global batch size: 256 | lm loss: 2.032895E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.316 | TFLOPs: 39.71 | 15: iteration 55100/ 125429 | consumed samples: 14105600 | consumed tokens: 28888268800 | elapsed time per iteration (s): 1.03 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 2.005695E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.398 | TFLOPs: 40.88 | 15: iteration 55110/ 125429 | consumed samples: 14108160 | consumed tokens: 28893511680 | elapsed time per iteration (s): 1.03 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 2.045755E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.289 | TFLOPs: 41.03 | 15: iteration 55120/ 125429 | consumed samples: 14110720 | consumed tokens: 28898754560 | elapsed time per iteration (s): 1.04 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 1.992043E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.933 | TFLOPs: 40.64 | 15: iteration 55130/ 125429 | consumed samples: 14113280 | consumed tokens: 28903997440 | elapsed time per iteration (s): 1.05 | learning rate: 1.286E-04 | global batch size: 256 | lm loss: 1.985593E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.088 | TFLOPs: 40.34 | 15: iteration 55140/ 125429 | consumed samples: 14115840 | consumed tokens: 28909240320 | elapsed time per iteration (s): 1.04 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 1.983171E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.555 | TFLOPs: 40.58 | 15: iteration 55150/ 125429 | consumed samples: 14118400 | consumed tokens: 28914483200 | elapsed time per iteration (s): 1.04 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.000229E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.022 | TFLOPs: 40.82 | 15: iteration 55160/ 125429 | consumed samples: 14120960 | consumed tokens: 28919726080 | elapsed time per iteration (s): 1.04 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.003608E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.079 | TFLOPs: 40.83 | 15: iteration 55170/ 125429 | consumed samples: 14123520 | consumed tokens: 28924968960 | elapsed time per iteration (s): 1.03 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.007370E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.303 | TFLOPs: 41.20 | 15: iteration 55180/ 125429 | consumed samples: 14126080 | consumed tokens: 28930211840 | elapsed time per iteration (s): 1.08 | learning rate: 1.285E-04 | global batch size: 256 | lm loss: 2.016022E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.864 | TFLOPs: 39.14 | 15: iteration 55190/ 125429 | consumed samples: 14128640 | consumed tokens: 28935454720 | elapsed time per iteration (s): 1.06 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.002907E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.046 | TFLOPs: 40.00 | 15: iteration 55200/ 125429 | consumed samples: 14131200 | consumed tokens: 28940697600 | elapsed time per iteration (s): 1.03 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.032802E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.762 | TFLOPs: 41.11 | 15: iteration 55210/ 125429 | consumed samples: 14133760 | consumed tokens: 28945940480 | elapsed time per iteration (s): 1.04 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.023973E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.114 | TFLOPs: 40.84 | 15: iteration 55220/ 125429 | consumed samples: 14136320 | consumed tokens: 28951183360 | elapsed time per iteration (s): 1.05 | learning rate: 1.284E-04 | global batch size: 256 | lm loss: 2.027628E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.830 | TFLOPs: 40.13 | 15: iteration 55230/ 125429 | consumed samples: 14138880 | consumed tokens: 28956426240 | elapsed time per iteration (s): 1.14 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 2.037527E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.707 | TFLOPs: 37.13 | 15: iteration 55240/ 125429 | consumed samples: 14141440 | consumed tokens: 28961669120 | elapsed time per iteration (s): 1.05 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 1.989437E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.701 | TFLOPs: 40.44 | 15: iteration 55250/ 125429 | consumed samples: 14144000 | consumed tokens: 28966912000 | elapsed time per iteration (s): 1.09 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 1.979279E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.914 | TFLOPs: 38.99 | 15: iteration 55260/ 125429 | consumed samples: 14146560 | consumed tokens: 28972154880 | elapsed time per iteration (s): 1.04 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 2.001721E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.076 | TFLOPs: 40.50 | 15: iteration 55270/ 125429 | consumed samples: 14149120 | consumed tokens: 28977397760 | elapsed time per iteration (s): 1.05 | learning rate: 1.283E-04 | global batch size: 256 | lm loss: 1.996650E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.876 | TFLOPs: 40.14 | 15: iteration 55280/ 125429 | consumed samples: 14151680 | consumed tokens: 28982640640 | elapsed time per iteration (s): 1.04 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 2.006111E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.674 | TFLOPs: 40.60 | 15: iteration 55290/ 125429 | consumed samples: 14154240 | consumed tokens: 28987883520 | elapsed time per iteration (s): 1.05 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 2.016955E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.698 | TFLOPs: 40.27 | 15: iteration 55300/ 125429 | consumed samples: 14156800 | consumed tokens: 28993126400 | elapsed time per iteration (s): 1.06 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 2.022952E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.914 | TFLOPs: 39.81 | 15: iteration 55310/ 125429 | consumed samples: 14159360 | consumed tokens: 28998369280 | elapsed time per iteration (s): 1.18 | learning rate: 1.282E-04 | global batch size: 256 | lm loss: 1.975384E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.463 | TFLOPs: 35.94 | 15: iteration 55320/ 125429 | consumed samples: 14161920 | consumed tokens: 29003612160 | elapsed time per iteration (s): 1.07 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.033371E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.664 | TFLOPs: 39.44 | 15: iteration 55330/ 125429 | consumed samples: 14164480 | consumed tokens: 29008855040 | elapsed time per iteration (s): 1.06 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 1.975327E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.140 | TFLOPs: 39.85 | 15: iteration 55340/ 125429 | consumed samples: 14167040 | consumed tokens: 29014097920 | elapsed time per iteration (s): 1.27 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.024517E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 202.233 | TFLOPs: 33.42 | 15: iteration 55350/ 125429 | consumed samples: 14169600 | consumed tokens: 29019340800 | elapsed time per iteration (s): 1.14 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 1.979014E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.999 | TFLOPs: 37.18 | 15: iteration 55360/ 125429 | consumed samples: 14172160 | consumed tokens: 29024583680 | elapsed time per iteration (s): 1.07 | learning rate: 1.281E-04 | global batch size: 256 | lm loss: 2.001153E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.307 | TFLOPs: 39.71 | 15: iteration 55370/ 125429 | consumed samples: 14174720 | consumed tokens: 29029826560 | elapsed time per iteration (s): 1.03 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 1.992809E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.485 | TFLOPs: 41.23 | 15: iteration 55380/ 125429 | consumed samples: 14177280 | consumed tokens: 29035069440 | elapsed time per iteration (s): 1.06 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 1.982373E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.252 | TFLOPs: 39.87 | 15: iteration 55390/ 125429 | consumed samples: 14179840 | consumed tokens: 29040312320 | elapsed time per iteration (s): 1.04 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 2.012221E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.266 | TFLOPs: 40.53 | 15: iteration 55400/ 125429 | consumed samples: 14182400 | consumed tokens: 29045555200 | elapsed time per iteration (s): 1.03 | learning rate: 1.280E-04 | global batch size: 256 | lm loss: 2.002973E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.147 | TFLOPs: 41.17 | 15: iteration 55410/ 125429 | consumed samples: 14184960 | consumed tokens: 29050798080 | elapsed time per iteration (s): 1.03 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 1.989247E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.741 | TFLOPs: 40.94 | 15: iteration 55420/ 125429 | consumed samples: 14187520 | consumed tokens: 29056040960 | elapsed time per iteration (s): 1.15 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.027198E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.905 | TFLOPs: 36.67 | 15: iteration 55430/ 125429 | consumed samples: 14190080 | consumed tokens: 29061283840 | elapsed time per iteration (s): 1.03 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 1.996808E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.734 | TFLOPs: 41.11 | 15: iteration 55440/ 125429 | consumed samples: 14192640 | consumed tokens: 29066526720 | elapsed time per iteration (s): 1.04 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.018248E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.766 | TFLOPs: 40.61 | 15: iteration 55450/ 125429 | consumed samples: 14195200 | consumed tokens: 29071769600 | elapsed time per iteration (s): 1.15 | learning rate: 1.279E-04 | global batch size: 256 | lm loss: 2.015760E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.699 | TFLOPs: 36.80 | 15: iteration 55460/ 125429 | consumed samples: 14197760 | consumed tokens: 29077012480 | elapsed time per iteration (s): 1.04 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 2.023518E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.830 | TFLOPs: 40.79 | 15: iteration 55470/ 125429 | consumed samples: 14200320 | consumed tokens: 29082255360 | elapsed time per iteration (s): 1.02 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 2.001875E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.999 | TFLOPs: 41.48 | 15: iteration 55480/ 125429 | consumed samples: 14202880 | consumed tokens: 29087498240 | elapsed time per iteration (s): 1.05 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 2.014531E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.778 | TFLOPs: 40.29 | 15: iteration 55490/ 125429 | consumed samples: 14205440 | consumed tokens: 29092741120 | elapsed time per iteration (s): 1.04 | learning rate: 1.278E-04 | global batch size: 256 | lm loss: 1.997249E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.869 | TFLOPs: 40.80 | 15: iteration 55500/ 125429 | consumed samples: 14208000 | consumed tokens: 29097984000 | elapsed time per iteration (s): 1.36 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 1.959161E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 188.389 | TFLOPs: 31.13 | 15: iteration 55510/ 125429 | consumed samples: 14210560 | consumed tokens: 29103226880 | elapsed time per iteration (s): 1.09 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 1.999680E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.374 | TFLOPs: 38.90 | 15: iteration 55520/ 125429 | consumed samples: 14213120 | consumed tokens: 29108469760 | elapsed time per iteration (s): 1.18 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 2.019937E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.654 | TFLOPs: 35.97 | 15: iteration 55530/ 125429 | consumed samples: 14215680 | consumed tokens: 29113712640 | elapsed time per iteration (s): 1.07 | learning rate: 1.277E-04 | global batch size: 256 | lm loss: 1.991017E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.630 | TFLOPs: 39.60 | 15: iteration 55540/ 125429 | consumed samples: 14218240 | consumed tokens: 29118955520 | elapsed time per iteration (s): 1.10 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.002025E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.407 | TFLOPs: 38.57 | 15: iteration 55550/ 125429 | consumed samples: 14220800 | consumed tokens: 29124198400 | elapsed time per iteration (s): 1.10 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 1.985685E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.761 | TFLOPs: 38.30 | 15: iteration 55560/ 125429 | consumed samples: 14223360 | consumed tokens: 29129441280 | elapsed time per iteration (s): 1.19 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.010192E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.405 | TFLOPs: 35.60 | 15: iteration 55570/ 125429 | consumed samples: 14225920 | consumed tokens: 29134684160 | elapsed time per iteration (s): 1.05 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 2.007622E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.757 | TFLOPs: 40.12 | 15: iteration 55580/ 125429 | consumed samples: 14228480 | consumed tokens: 29139927040 | elapsed time per iteration (s): 1.06 | learning rate: 1.276E-04 | global batch size: 256 | lm loss: 1.979011E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.442 | TFLOPs: 40.07 | 15: iteration 55590/ 125429 | consumed samples: 14231040 | consumed tokens: 29145169920 | elapsed time per iteration (s): 1.06 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.017668E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.835 | TFLOPs: 39.80 | 15: iteration 55600/ 125429 | consumed samples: 14233600 | consumed tokens: 29150412800 | elapsed time per iteration (s): 1.04 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.004766E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.395 | TFLOPs: 40.72 | 15: iteration 55610/ 125429 | consumed samples: 14236160 | consumed tokens: 29155655680 | elapsed time per iteration (s): 1.05 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.030888E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.874 | TFLOPs: 40.47 | 15: iteration 55620/ 125429 | consumed samples: 14238720 | consumed tokens: 29160898560 | elapsed time per iteration (s): 1.10 | learning rate: 1.275E-04 | global batch size: 256 | lm loss: 2.011171E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.010 | TFLOPs: 38.34 | 15: iteration 55630/ 125429 | consumed samples: 14241280 | consumed tokens: 29166141440 | elapsed time per iteration (s): 1.09 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.000528E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.289 | TFLOPs: 38.72 | 15: iteration 55640/ 125429 | consumed samples: 14243840 | consumed tokens: 29171384320 | elapsed time per iteration (s): 1.07 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 1.978305E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.191 | TFLOPs: 39.69 | 15: iteration 55650/ 125429 | consumed samples: 14246400 | consumed tokens: 29176627200 | elapsed time per iteration (s): 1.04 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.035085E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.282 | TFLOPs: 40.53 | 15: iteration 55660/ 125429 | consumed samples: 14248960 | consumed tokens: 29181870080 | elapsed time per iteration (s): 1.05 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.001560E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.812 | TFLOPs: 40.46 | 15: iteration 55670/ 125429 | consumed samples: 14251520 | consumed tokens: 29187112960 | elapsed time per iteration (s): 1.04 | learning rate: 1.274E-04 | global batch size: 256 | lm loss: 2.007608E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.550 | TFLOPs: 40.58 | 15: iteration 55680/ 125429 | consumed samples: 14254080 | consumed tokens: 29192355840 | elapsed time per iteration (s): 1.07 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 1.993199E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.489 | TFLOPs: 39.58 | 15: iteration 55690/ 125429 | consumed samples: 14256640 | consumed tokens: 29197598720 | elapsed time per iteration (s): 1.05 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 1.996919E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.458 | TFLOPs: 40.40 | 15: iteration 55700/ 125429 | consumed samples: 14259200 | consumed tokens: 29202841600 | elapsed time per iteration (s): 1.10 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 1.995869E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.041 | TFLOPs: 38.35 | 15: iteration 55710/ 125429 | consumed samples: 14261760 | consumed tokens: 29208084480 | elapsed time per iteration (s): 1.03 | learning rate: 1.273E-04 | global batch size: 256 | lm loss: 1.983243E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.620 | TFLOPs: 41.09 | 15: iteration 55720/ 125429 | consumed samples: 14264320 | consumed tokens: 29213327360 | elapsed time per iteration (s): 1.04 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 1.987263E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.923 | TFLOPs: 40.64 | 15: iteration 55730/ 125429 | consumed samples: 14266880 | consumed tokens: 29218570240 | elapsed time per iteration (s): 1.05 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 1.974714E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.613 | TFLOPs: 40.42 | 15: iteration 55740/ 125429 | consumed samples: 14269440 | consumed tokens: 29223813120 | elapsed time per iteration (s): 1.10 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 2.021511E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.438 | TFLOPs: 38.58 | 15: iteration 55750/ 125429 | consumed samples: 14272000 | consumed tokens: 29229056000 | elapsed time per iteration (s): 1.04 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 2.031933E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.738 | TFLOPs: 40.78 | 15: iteration 55760/ 125429 | consumed samples: 14274560 | consumed tokens: 29234298880 | elapsed time per iteration (s): 1.07 | learning rate: 1.272E-04 | global batch size: 256 | lm loss: 1.991566E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.991 | TFLOPs: 39.66 | 15: iteration 55770/ 125429 | consumed samples: 14277120 | consumed tokens: 29239541760 | elapsed time per iteration (s): 1.05 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 1.991736E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.834 | TFLOPs: 40.30 | 15: iteration 55780/ 125429 | consumed samples: 14279680 | consumed tokens: 29244784640 | elapsed time per iteration (s): 1.08 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 2.036292E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.158 | TFLOPs: 39.03 | 15: iteration 55790/ 125429 | consumed samples: 14282240 | consumed tokens: 29250027520 | elapsed time per iteration (s): 1.06 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 2.008829E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.646 | TFLOPs: 39.93 | 15: iteration 55800/ 125429 | consumed samples: 14284800 | consumed tokens: 29255270400 | elapsed time per iteration (s): 1.05 | learning rate: 1.271E-04 | global batch size: 256 | lm loss: 2.012968E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.416 | TFLOPs: 40.23 | 15: iteration 55810/ 125429 | consumed samples: 14287360 | consumed tokens: 29260513280 | elapsed time per iteration (s): 1.07 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.002198E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.733 | TFLOPs: 39.62 | 15: iteration 55820/ 125429 | consumed samples: 14289920 | consumed tokens: 29265756160 | elapsed time per iteration (s): 1.03 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.018505E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.519 | TFLOPs: 40.90 | 15: iteration 55830/ 125429 | consumed samples: 14292480 | consumed tokens: 29270999040 | elapsed time per iteration (s): 1.05 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.013823E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.565 | TFLOPs: 40.25 | 15: iteration 55840/ 125429 | consumed samples: 14295040 | consumed tokens: 29276241920 | elapsed time per iteration (s): 1.03 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.004922E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.922 | TFLOPs: 40.97 | 15: iteration 55850/ 125429 | consumed samples: 14297600 | consumed tokens: 29281484800 | elapsed time per iteration (s): 1.04 | learning rate: 1.270E-04 | global batch size: 256 | lm loss: 2.026138E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.408 | TFLOPs: 40.56 | 15: iteration 55860/ 125429 | consumed samples: 14300160 | consumed tokens: 29286727680 | elapsed time per iteration (s): 1.02 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 1.991232E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.347 | TFLOPs: 41.37 | 15: iteration 55870/ 125429 | consumed samples: 14302720 | consumed tokens: 29291970560 | elapsed time per iteration (s): 1.09 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 1.989210E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.696 | TFLOPs: 38.79 | 15: iteration 55880/ 125429 | consumed samples: 14305280 | consumed tokens: 29297213440 | elapsed time per iteration (s): 1.11 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 2.006807E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.771 | TFLOPs: 37.97 | 15: iteration 55890/ 125429 | consumed samples: 14307840 | consumed tokens: 29302456320 | elapsed time per iteration (s): 1.06 | learning rate: 1.269E-04 | global batch size: 256 | lm loss: 2.004451E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.607 | TFLOPs: 39.93 | 15: iteration 55900/ 125429 | consumed samples: 14310400 | consumed tokens: 29307699200 | elapsed time per iteration (s): 1.05 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.035213E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.373 | TFLOPs: 40.22 | 15: iteration 55910/ 125429 | consumed samples: 14312960 | consumed tokens: 29312942080 | elapsed time per iteration (s): 1.03 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 1.976255E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.651 | TFLOPs: 41.09 | 15: iteration 55920/ 125429 | consumed samples: 14315520 | consumed tokens: 29318184960 | elapsed time per iteration (s): 1.05 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 1.983348E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.151 | TFLOPs: 40.18 | 15: iteration 55930/ 125429 | consumed samples: 14318080 | consumed tokens: 29323427840 | elapsed time per iteration (s): 1.05 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 2.005169E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.965 | TFLOPs: 40.32 | 15: iteration 55940/ 125429 | consumed samples: 14320640 | consumed tokens: 29328670720 | elapsed time per iteration (s): 1.04 | learning rate: 1.268E-04 | global batch size: 256 | lm loss: 1.985077E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.337 | TFLOPs: 40.54 | 15: iteration 55950/ 125429 | consumed samples: 14323200 | consumed tokens: 29333913600 | elapsed time per iteration (s): 1.04 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 1.979247E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.864 | TFLOPs: 40.63 | 15: iteration 55960/ 125429 | consumed samples: 14325760 | consumed tokens: 29339156480 | elapsed time per iteration (s): 1.98 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 2.005295E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 129.330 | TFLOPs: 21.37 | 15: iteration 55970/ 125429 | consumed samples: 14328320 | consumed tokens: 29344399360 | elapsed time per iteration (s): 1.04 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 2.005486E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.938 | TFLOPs: 40.81 | 15: iteration 55980/ 125429 | consumed samples: 14330880 | consumed tokens: 29349642240 | elapsed time per iteration (s): 1.04 | learning rate: 1.267E-04 | global batch size: 256 | lm loss: 2.022377E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.069 | TFLOPs: 40.83 | 15: iteration 55990/ 125429 | consumed samples: 14333440 | consumed tokens: 29354885120 | elapsed time per iteration (s): 1.02 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 1.998662E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.902 | TFLOPs: 41.30 | 0: [2022-11-26 12:40:39,861] [INFO] [logging.py:68:log_dist] [Rank 0] step=56000, skipped=0, lr=[0.00012662087986906457, 0.00012662087986906457, 0.00012662087986906457], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 56000/ 125429 | consumed samples: 14336000 | consumed tokens: 29360128000 | elapsed time per iteration (s): 1.03 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 1.985278E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.771 | TFLOPs: 40.95 | 0: steps: 56000 loss: 1.9606 iter time (s): 1.067 samples/sec: 240.012 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 56000 | lm loss value: 1.949695E+00 | lm loss PPL: 7.026542E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 56000 to checkpoints_1b5 0: [2022-11-26 12:40:40,239] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step56000 is begin to save! 0: [2022-11-26 12:40:40,248] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:40:40,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:40:40,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:40:40,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:40:40,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:40:40,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:40:40,733] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:40:40,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:40:40,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:40:40,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:40:40,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:40:41,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:40:41,077] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:40:41,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:40:41,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:40:41,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:40:41,309] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:40:41,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:40:41,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:40:41,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:40:41,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:40:41,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:40:41,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:40:41,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:40:41,771] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:40:41,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:40:41,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:40:41,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:40:41,992] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:40:42,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:40:42,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:40:42,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:40:42,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:40:42,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:40:42,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:40:42,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:40:42,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:40:42,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:40:42,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:40:42,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:40:42,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:40:42,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:40:42,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:40:42,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:40:42,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:40:42,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:40:42,961] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:40:43,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:40:43,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:40:43,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:40:43,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:40:43,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:40:43,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:40:43,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:40:43,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_29-model_00-model_states.pt... 0: [2022-11-26 12:40:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_29-model_00-model_states.pt. 0: [2022-11-26 12:40:43,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:40:43,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:40:43,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/layer_32-model_00-model_states.pt... 0: [2022-11-26 12:40:43,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/layer_32-model_00-model_states.pt. 0: [2022-11-26 12:40:43,611] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step56000/mp_rank_00_model_states.pt 0: [2022-11-26 12:40:43,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:40:43,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:40:43,655] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step56000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:40:43,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:43,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 12:40:43,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 12:40:43,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 12:40:43,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:43,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:40:43,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 12: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:40:43,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:43,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 7: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 12:40:43,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 12:40:43,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 12:40:43,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:43,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:43,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 12:40:43,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:43,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:43,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:43,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:43,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:43,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:43,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:40:43,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 6: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 8: [2022-11-26 12:40:43,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:40:43,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 12:40:43,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 12:40:43,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:43,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:43,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:43,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 12:40:43,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:43,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 12:40:43,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:43,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 12:40:43,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:40:43,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 9: [2022-11-26 12:40:43,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:40:43,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:40:43,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 2: [2022-11-26 12:40:43,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:40:43,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 12:40:43,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 12:40:43,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 5: [2022-11-26 12:40:43,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:40:43,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:40:43,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 15: [2022-11-26 12:40:43,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:43,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:40:43,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:43,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 11: [2022-11-26 12:40:43,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:43,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:43,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 1: [2022-11-26 12:40:43,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:40:43,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:43,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:43,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:43,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:40:43,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:43,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 14: [2022-11-26 12:40:43,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:40:43,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:40:43,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 3: [2022-11-26 12:40:43,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:40:43,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:40:43,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: [2022-11-26 12:40:44,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:40:44,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 13: [2022-11-26 12:40:44,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 10: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 12:40:44,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 4: [2022-11-26 12:40:44,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:40:44,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step56000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:40:44,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step56000 is ready now! 0: successfully saved checkpoint at iteration 56000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3921.42 15: iteration 56010/ 125429 | consumed samples: 14338560 | consumed tokens: 29365370880 | elapsed time per iteration (s): 1.50 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 2.029479E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 170.755 | TFLOPs: 28.22 | 15: iteration 56020/ 125429 | consumed samples: 14341120 | consumed tokens: 29370613760 | elapsed time per iteration (s): 1.05 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 1.981033E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.395 | TFLOPs: 40.39 | 15: iteration 56030/ 125429 | consumed samples: 14343680 | consumed tokens: 29375856640 | elapsed time per iteration (s): 1.06 | learning rate: 1.266E-04 | global batch size: 256 | lm loss: 2.043799E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.359 | TFLOPs: 39.89 | 15: iteration 56040/ 125429 | consumed samples: 14346240 | consumed tokens: 29381099520 | elapsed time per iteration (s): 1.04 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 2.033879E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.982 | TFLOPs: 40.49 | 15: iteration 56050/ 125429 | consumed samples: 14348800 | consumed tokens: 29386342400 | elapsed time per iteration (s): 1.03 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 1.960117E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.871 | TFLOPs: 40.96 | 15: iteration 56060/ 125429 | consumed samples: 14351360 | consumed tokens: 29391585280 | elapsed time per iteration (s): 1.03 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 1.965754E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.093 | TFLOPs: 41.00 | 15: iteration 56070/ 125429 | consumed samples: 14353920 | consumed tokens: 29396828160 | elapsed time per iteration (s): 1.04 | learning rate: 1.265E-04 | global batch size: 256 | lm loss: 2.002613E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.741 | TFLOPs: 40.78 | 15: iteration 56080/ 125429 | consumed samples: 14356480 | consumed tokens: 29402071040 | elapsed time per iteration (s): 1.05 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.018481E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.856 | TFLOPs: 40.13 | 15: iteration 56090/ 125429 | consumed samples: 14359040 | consumed tokens: 29407313920 | elapsed time per iteration (s): 1.05 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 1.990697E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.837 | TFLOPs: 40.13 | 15: iteration 56100/ 125429 | consumed samples: 14361600 | consumed tokens: 29412556800 | elapsed time per iteration (s): 1.02 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.021124E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.789 | TFLOPs: 41.28 | 15: iteration 56110/ 125429 | consumed samples: 14364160 | consumed tokens: 29417799680 | elapsed time per iteration (s): 1.03 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.008061E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.080 | TFLOPs: 41.16 | 15: iteration 56120/ 125429 | consumed samples: 14366720 | consumed tokens: 29423042560 | elapsed time per iteration (s): 1.03 | learning rate: 1.264E-04 | global batch size: 256 | lm loss: 2.004316E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.048 | TFLOPs: 40.99 | 15: iteration 56130/ 125429 | consumed samples: 14369280 | consumed tokens: 29428285440 | elapsed time per iteration (s): 1.10 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.003094E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.817 | TFLOPs: 38.47 | 15: iteration 56140/ 125429 | consumed samples: 14371840 | consumed tokens: 29433528320 | elapsed time per iteration (s): 1.05 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.013848E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.797 | TFLOPs: 40.29 | 15: iteration 56150/ 125429 | consumed samples: 14374400 | consumed tokens: 29438771200 | elapsed time per iteration (s): 1.04 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.013069E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.000 | TFLOPs: 40.82 | 15: iteration 56160/ 125429 | consumed samples: 14376960 | consumed tokens: 29444014080 | elapsed time per iteration (s): 1.08 | learning rate: 1.263E-04 | global batch size: 256 | lm loss: 2.014140E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.738 | TFLOPs: 39.29 | 15: iteration 56170/ 125429 | consumed samples: 14379520 | consumed tokens: 29449256960 | elapsed time per iteration (s): 1.03 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 2.016378E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.297 | TFLOPs: 41.03 | 15: iteration 56180/ 125429 | consumed samples: 14382080 | consumed tokens: 29454499840 | elapsed time per iteration (s): 1.07 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 1.987066E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.829 | TFLOPs: 39.63 | 15: iteration 56190/ 125429 | consumed samples: 14384640 | consumed tokens: 29459742720 | elapsed time per iteration (s): 1.04 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 1.965399E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.320 | TFLOPs: 40.87 | 15: iteration 56200/ 125429 | consumed samples: 14387200 | consumed tokens: 29464985600 | elapsed time per iteration (s): 1.05 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 1.992068E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.334 | TFLOPs: 40.38 | 15: iteration 56210/ 125429 | consumed samples: 14389760 | consumed tokens: 29470228480 | elapsed time per iteration (s): 1.02 | learning rate: 1.262E-04 | global batch size: 256 | lm loss: 1.986182E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.288 | TFLOPs: 41.53 | 15: iteration 56220/ 125429 | consumed samples: 14392320 | consumed tokens: 29475471360 | elapsed time per iteration (s): 1.02 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 1.990282E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.895 | TFLOPs: 41.30 | 15: iteration 56230/ 125429 | consumed samples: 14394880 | consumed tokens: 29480714240 | elapsed time per iteration (s): 1.04 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 2.025565E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.581 | TFLOPs: 40.58 | 15: iteration 56240/ 125429 | consumed samples: 14397440 | consumed tokens: 29485957120 | elapsed time per iteration (s): 1.04 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 1.988198E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.793 | TFLOPs: 40.78 | 15: iteration 56250/ 125429 | consumed samples: 14400000 | consumed tokens: 29491200000 | elapsed time per iteration (s): 1.03 | learning rate: 1.261E-04 | global batch size: 256 | lm loss: 1.998762E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.112 | TFLOPs: 41.17 | 15: iteration 56260/ 125429 | consumed samples: 14402560 | consumed tokens: 29496442880 | elapsed time per iteration (s): 1.03 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 1.997642E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.763 | TFLOPs: 41.11 | 15: iteration 56270/ 125429 | consumed samples: 14405120 | consumed tokens: 29501685760 | elapsed time per iteration (s): 1.08 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 1.984621E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.048 | TFLOPs: 39.01 | 15: iteration 56280/ 125429 | consumed samples: 14407680 | consumed tokens: 29506928640 | elapsed time per iteration (s): 1.04 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.012601E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.126 | TFLOPs: 40.67 | 15: iteration 56290/ 125429 | consumed samples: 14410240 | consumed tokens: 29512171520 | elapsed time per iteration (s): 1.02 | learning rate: 1.260E-04 | global batch size: 256 | lm loss: 2.020687E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.305 | TFLOPs: 41.36 | 15: iteration 56300/ 125429 | consumed samples: 14412800 | consumed tokens: 29517414400 | elapsed time per iteration (s): 1.03 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 1.995252E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.569 | TFLOPs: 41.24 | 15: iteration 56310/ 125429 | consumed samples: 14415360 | consumed tokens: 29522657280 | elapsed time per iteration (s): 1.05 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.008226E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.375 | TFLOPs: 40.22 | 15: iteration 56320/ 125429 | consumed samples: 14417920 | consumed tokens: 29527900160 | elapsed time per iteration (s): 1.02 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.000697E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.547 | TFLOPs: 41.57 | 15: iteration 56330/ 125429 | consumed samples: 14420480 | consumed tokens: 29533143040 | elapsed time per iteration (s): 1.03 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 1.969727E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.441 | TFLOPs: 40.89 | 15: iteration 56340/ 125429 | consumed samples: 14423040 | consumed tokens: 29538385920 | elapsed time per iteration (s): 1.03 | learning rate: 1.259E-04 | global batch size: 256 | lm loss: 2.015292E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.471 | TFLOPs: 40.90 | 15: iteration 56350/ 125429 | consumed samples: 14425600 | consumed tokens: 29543628800 | elapsed time per iteration (s): 1.04 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.011034E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.400 | TFLOPs: 40.55 | 15: iteration 56360/ 125429 | consumed samples: 14428160 | consumed tokens: 29548871680 | elapsed time per iteration (s): 1.05 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.012258E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.025 | TFLOPs: 40.16 | 15: iteration 56370/ 125429 | consumed samples: 14430720 | consumed tokens: 29554114560 | elapsed time per iteration (s): 1.05 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 2.018089E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.782 | TFLOPs: 40.45 | 15: iteration 56380/ 125429 | consumed samples: 14433280 | consumed tokens: 29559357440 | elapsed time per iteration (s): 1.04 | learning rate: 1.258E-04 | global batch size: 256 | lm loss: 1.978226E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.294 | TFLOPs: 40.54 | 15: iteration 56390/ 125429 | consumed samples: 14435840 | consumed tokens: 29564600320 | elapsed time per iteration (s): 1.03 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.006085E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.774 | TFLOPs: 40.95 | 15: iteration 56400/ 125429 | consumed samples: 14438400 | consumed tokens: 29569843200 | elapsed time per iteration (s): 1.05 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 1.989991E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.443 | TFLOPs: 40.23 | 15: iteration 56410/ 125429 | consumed samples: 14440960 | consumed tokens: 29575086080 | elapsed time per iteration (s): 1.07 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.009028E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.213 | TFLOPs: 39.70 | 15: iteration 56420/ 125429 | consumed samples: 14443520 | consumed tokens: 29580328960 | elapsed time per iteration (s): 1.04 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.009995E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.683 | TFLOPs: 40.77 | 15: iteration 56430/ 125429 | consumed samples: 14446080 | consumed tokens: 29585571840 | elapsed time per iteration (s): 1.03 | learning rate: 1.257E-04 | global batch size: 256 | lm loss: 2.000884E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.883 | TFLOPs: 41.13 | 15: iteration 56440/ 125429 | consumed samples: 14448640 | consumed tokens: 29590814720 | elapsed time per iteration (s): 1.05 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 1.987600E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.919 | TFLOPs: 40.14 | 15: iteration 56450/ 125429 | consumed samples: 14451200 | consumed tokens: 29596057600 | elapsed time per iteration (s): 1.06 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 1.977444E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.353 | TFLOPs: 39.89 | 15: iteration 56460/ 125429 | consumed samples: 14453760 | consumed tokens: 29601300480 | elapsed time per iteration (s): 1.02 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 1.984050E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.239 | TFLOPs: 41.52 | 15: iteration 56470/ 125429 | consumed samples: 14456320 | consumed tokens: 29606543360 | elapsed time per iteration (s): 1.03 | learning rate: 1.256E-04 | global batch size: 256 | lm loss: 2.001649E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.441 | TFLOPs: 41.06 | 15: iteration 56480/ 125429 | consumed samples: 14458880 | consumed tokens: 29611786240 | elapsed time per iteration (s): 1.03 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 1.991259E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.915 | TFLOPs: 40.97 | 15: iteration 56490/ 125429 | consumed samples: 14461440 | consumed tokens: 29617029120 | elapsed time per iteration (s): 1.06 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 1.996188E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.342 | TFLOPs: 39.88 | 15: iteration 56500/ 125429 | consumed samples: 14464000 | consumed tokens: 29622272000 | elapsed time per iteration (s): 1.06 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 2.008057E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.747 | TFLOPs: 39.79 | 15: iteration 56510/ 125429 | consumed samples: 14466560 | consumed tokens: 29627514880 | elapsed time per iteration (s): 1.04 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 2.006986E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.942 | TFLOPs: 40.81 | 15: iteration 56520/ 125429 | consumed samples: 14469120 | consumed tokens: 29632757760 | elapsed time per iteration (s): 1.03 | learning rate: 1.255E-04 | global batch size: 256 | lm loss: 2.008219E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.740 | TFLOPs: 41.27 | 15: iteration 56530/ 125429 | consumed samples: 14471680 | consumed tokens: 29638000640 | elapsed time per iteration (s): 1.04 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 1.989606E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.025 | TFLOPs: 40.49 | 15: iteration 56540/ 125429 | consumed samples: 14474240 | consumed tokens: 29643243520 | elapsed time per iteration (s): 1.04 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 2.014456E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.827 | TFLOPs: 40.79 | 15: iteration 56550/ 125429 | consumed samples: 14476800 | consumed tokens: 29648486400 | elapsed time per iteration (s): 1.05 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 2.001337E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.612 | TFLOPs: 40.42 | 15: iteration 56560/ 125429 | consumed samples: 14479360 | consumed tokens: 29653729280 | elapsed time per iteration (s): 1.05 | learning rate: 1.254E-04 | global batch size: 256 | lm loss: 2.001216E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.346 | TFLOPs: 40.21 | 15: iteration 56570/ 125429 | consumed samples: 14481920 | consumed tokens: 29658972160 | elapsed time per iteration (s): 1.03 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.011907E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.302 | TFLOPs: 41.20 | 15: iteration 56580/ 125429 | consumed samples: 14484480 | consumed tokens: 29664215040 | elapsed time per iteration (s): 1.03 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.006793E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.641 | TFLOPs: 41.26 | 15: iteration 56590/ 125429 | consumed samples: 14487040 | consumed tokens: 29669457920 | elapsed time per iteration (s): 1.06 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 2.025523E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.987 | TFLOPs: 39.82 | 15: iteration 56600/ 125429 | consumed samples: 14489600 | consumed tokens: 29674700800 | elapsed time per iteration (s): 1.05 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 1.990311E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.416 | TFLOPs: 40.23 | 15: iteration 56610/ 125429 | consumed samples: 14492160 | consumed tokens: 29679943680 | elapsed time per iteration (s): 1.03 | learning rate: 1.253E-04 | global batch size: 256 | lm loss: 1.975993E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.893 | TFLOPs: 40.97 | 15: iteration 56620/ 125429 | consumed samples: 14494720 | consumed tokens: 29685186560 | elapsed time per iteration (s): 1.06 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 2.011501E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.128 | TFLOPs: 39.85 | 15: iteration 56630/ 125429 | consumed samples: 14497280 | consumed tokens: 29690429440 | elapsed time per iteration (s): 1.03 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 1.999144E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.512 | TFLOPs: 41.23 | 15: iteration 56640/ 125429 | consumed samples: 14499840 | consumed tokens: 29695672320 | elapsed time per iteration (s): 1.05 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 1.994127E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.845 | TFLOPs: 40.46 | 15: iteration 56650/ 125429 | consumed samples: 14502400 | consumed tokens: 29700915200 | elapsed time per iteration (s): 1.07 | learning rate: 1.252E-04 | global batch size: 256 | lm loss: 1.998002E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.066 | TFLOPs: 39.51 | 15: iteration 56660/ 125429 | consumed samples: 14504960 | consumed tokens: 29706158080 | elapsed time per iteration (s): 1.06 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.026394E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.852 | TFLOPs: 39.97 | 15: iteration 56670/ 125429 | consumed samples: 14507520 | consumed tokens: 29711400960 | elapsed time per iteration (s): 1.06 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.005355E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.450 | TFLOPs: 40.07 | 15: iteration 56680/ 125429 | consumed samples: 14510080 | consumed tokens: 29716643840 | elapsed time per iteration (s): 1.04 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 1.988784E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.623 | TFLOPs: 40.59 | 15: iteration 56690/ 125429 | consumed samples: 14512640 | consumed tokens: 29721886720 | elapsed time per iteration (s): 1.06 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.003590E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.858 | TFLOPs: 39.80 | 15: iteration 56700/ 125429 | consumed samples: 14515200 | consumed tokens: 29727129600 | elapsed time per iteration (s): 1.03 | learning rate: 1.251E-04 | global batch size: 256 | lm loss: 2.005098E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.540 | TFLOPs: 40.91 | 15: iteration 56710/ 125429 | consumed samples: 14517760 | consumed tokens: 29732372480 | elapsed time per iteration (s): 1.05 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 1.983684E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.661 | TFLOPs: 40.43 | 15: iteration 56720/ 125429 | consumed samples: 14520320 | consumed tokens: 29737615360 | elapsed time per iteration (s): 1.05 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 2.007913E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.191 | TFLOPs: 40.19 | 15: iteration 56730/ 125429 | consumed samples: 14522880 | consumed tokens: 29742858240 | elapsed time per iteration (s): 1.05 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 1.994964E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.203 | TFLOPs: 40.19 | 15: iteration 56740/ 125429 | consumed samples: 14525440 | consumed tokens: 29748101120 | elapsed time per iteration (s): 1.04 | learning rate: 1.250E-04 | global batch size: 256 | lm loss: 2.001513E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.900 | TFLOPs: 40.64 | 15: iteration 56750/ 125429 | consumed samples: 14528000 | consumed tokens: 29753344000 | elapsed time per iteration (s): 1.04 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 1.963419E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.149 | TFLOPs: 40.68 | 15: iteration 56760/ 125429 | consumed samples: 14530560 | consumed tokens: 29758586880 | elapsed time per iteration (s): 1.06 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 1.988700E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.268 | TFLOPs: 39.87 | 15: iteration 56770/ 125429 | consumed samples: 14533120 | consumed tokens: 29763829760 | elapsed time per iteration (s): 1.02 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 2.002442E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.951 | TFLOPs: 41.47 | 15: iteration 56780/ 125429 | consumed samples: 14535680 | consumed tokens: 29769072640 | elapsed time per iteration (s): 1.04 | learning rate: 1.249E-04 | global batch size: 256 | lm loss: 1.998407E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.570 | TFLOPs: 40.75 | 15: iteration 56790/ 125429 | consumed samples: 14538240 | consumed tokens: 29774315520 | elapsed time per iteration (s): 1.05 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.989369E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.750 | TFLOPs: 40.28 | 15: iteration 56800/ 125429 | consumed samples: 14540800 | consumed tokens: 29779558400 | elapsed time per iteration (s): 1.08 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 2.025025E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.942 | TFLOPs: 39.16 | 15: iteration 56810/ 125429 | consumed samples: 14543360 | consumed tokens: 29784801280 | elapsed time per iteration (s): 1.03 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.981028E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.435 | TFLOPs: 41.06 | 15: iteration 56820/ 125429 | consumed samples: 14545920 | consumed tokens: 29790044160 | elapsed time per iteration (s): 1.06 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.962055E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.310 | TFLOPs: 39.88 | 15: iteration 56830/ 125429 | consumed samples: 14548480 | consumed tokens: 29795287040 | elapsed time per iteration (s): 1.06 | learning rate: 1.248E-04 | global batch size: 256 | lm loss: 1.991565E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.436 | TFLOPs: 40.06 | 15: iteration 56840/ 125429 | consumed samples: 14551040 | consumed tokens: 29800529920 | elapsed time per iteration (s): 1.05 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 1.994999E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.603 | TFLOPs: 40.42 | 15: iteration 56850/ 125429 | consumed samples: 14553600 | consumed tokens: 29805772800 | elapsed time per iteration (s): 1.13 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 1.967234E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.234 | TFLOPs: 37.55 | 15: iteration 56860/ 125429 | consumed samples: 14556160 | consumed tokens: 29811015680 | elapsed time per iteration (s): 1.09 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 2.031211E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.672 | TFLOPs: 38.95 | 15: iteration 56870/ 125429 | consumed samples: 14558720 | consumed tokens: 29816258560 | elapsed time per iteration (s): 1.05 | learning rate: 1.247E-04 | global batch size: 256 | lm loss: 1.993430E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.174 | TFLOPs: 40.19 | 15: iteration 56880/ 125429 | consumed samples: 14561280 | consumed tokens: 29821501440 | elapsed time per iteration (s): 1.04 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 1.977234E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.275 | TFLOPs: 40.86 | 15: iteration 56890/ 125429 | consumed samples: 14563840 | consumed tokens: 29826744320 | elapsed time per iteration (s): 1.03 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 1.990178E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.161 | TFLOPs: 41.01 | 15: iteration 56900/ 125429 | consumed samples: 14566400 | consumed tokens: 29831987200 | elapsed time per iteration (s): 1.08 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 1.976443E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.169 | TFLOPs: 39.19 | 15: iteration 56910/ 125429 | consumed samples: 14568960 | consumed tokens: 29837230080 | elapsed time per iteration (s): 1.05 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 2.022467E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.872 | TFLOPs: 40.47 | 15: iteration 56920/ 125429 | consumed samples: 14571520 | consumed tokens: 29842472960 | elapsed time per iteration (s): 1.06 | learning rate: 1.246E-04 | global batch size: 256 | lm loss: 2.008311E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.434 | TFLOPs: 39.90 | 15: iteration 56930/ 125429 | consumed samples: 14574080 | consumed tokens: 29847715840 | elapsed time per iteration (s): 1.10 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 1.993024E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.677 | TFLOPs: 38.62 | 15: iteration 56940/ 125429 | consumed samples: 14576640 | consumed tokens: 29852958720 | elapsed time per iteration (s): 1.03 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 2.008817E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.142 | TFLOPs: 41.01 | 15: iteration 56950/ 125429 | consumed samples: 14579200 | consumed tokens: 29858201600 | elapsed time per iteration (s): 1.04 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 1.990938E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.732 | TFLOPs: 40.61 | 15: iteration 56960/ 125429 | consumed samples: 14581760 | consumed tokens: 29863444480 | elapsed time per iteration (s): 1.05 | learning rate: 1.245E-04 | global batch size: 256 | lm loss: 1.972486E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.434 | TFLOPs: 40.23 | 15: iteration 56970/ 125429 | consumed samples: 14584320 | consumed tokens: 29868687360 | elapsed time per iteration (s): 1.09 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.000184E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.700 | TFLOPs: 38.79 | 15: iteration 56980/ 125429 | consumed samples: 14586880 | consumed tokens: 29873930240 | elapsed time per iteration (s): 1.05 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 1.996931E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.218 | TFLOPs: 40.19 | 15: iteration 56990/ 125429 | consumed samples: 14589440 | consumed tokens: 29879173120 | elapsed time per iteration (s): 1.06 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 1.984983E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.762 | TFLOPs: 39.95 | 15: iteration 57000/ 125429 | consumed samples: 14592000 | consumed tokens: 29884416000 | elapsed time per iteration (s): 1.09 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 2.001612E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.222 | TFLOPs: 38.87 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 57000 | lm loss value: 1.998608E+00 | lm loss PPL: 7.378777E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 57000 to checkpoints_1b5 0: [2022-11-26 12:58:12,366] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step57000 is begin to save! 0: [2022-11-26 12:58:12,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_01-model_00-model_states.pt... 0: [2022-11-26 12:58:12,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_01-model_00-model_states.pt. 0: [2022-11-26 12:58:12,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_03-model_00-model_states.pt... 0: [2022-11-26 12:58:12,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_03-model_00-model_states.pt. 0: [2022-11-26 12:58:12,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_04-model_00-model_states.pt... 0: [2022-11-26 12:58:12,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_04-model_00-model_states.pt. 0: [2022-11-26 12:58:12,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_05-model_00-model_states.pt... 0: [2022-11-26 12:58:12,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_05-model_00-model_states.pt. 0: [2022-11-26 12:58:12,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_06-model_00-model_states.pt... 0: [2022-11-26 12:58:13,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_06-model_00-model_states.pt. 0: [2022-11-26 12:58:13,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_07-model_00-model_states.pt... 0: [2022-11-26 12:58:13,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_07-model_00-model_states.pt. 0: [2022-11-26 12:58:13,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_08-model_00-model_states.pt... 0: [2022-11-26 12:58:13,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_08-model_00-model_states.pt. 0: [2022-11-26 12:58:13,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_09-model_00-model_states.pt... 0: [2022-11-26 12:58:13,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_09-model_00-model_states.pt. 0: [2022-11-26 12:58:13,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_10-model_00-model_states.pt... 0: [2022-11-26 12:58:13,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_10-model_00-model_states.pt. 0: [2022-11-26 12:58:13,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_11-model_00-model_states.pt... 0: [2022-11-26 12:58:13,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_11-model_00-model_states.pt. 0: [2022-11-26 12:58:13,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_12-model_00-model_states.pt... 0: [2022-11-26 12:58:13,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_12-model_00-model_states.pt. 0: [2022-11-26 12:58:13,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_13-model_00-model_states.pt... 0: [2022-11-26 12:58:13,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_13-model_00-model_states.pt. 0: [2022-11-26 12:58:13,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_14-model_00-model_states.pt... 0: [2022-11-26 12:58:13,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_14-model_00-model_states.pt. 0: [2022-11-26 12:58:13,881] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_15-model_00-model_states.pt... 0: [2022-11-26 12:58:13,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_15-model_00-model_states.pt. 0: [2022-11-26 12:58:13,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_16-model_00-model_states.pt... 0: [2022-11-26 12:58:14,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_16-model_00-model_states.pt. 0: [2022-11-26 12:58:14,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_17-model_00-model_states.pt... 0: [2022-11-26 12:58:14,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_17-model_00-model_states.pt. 0: [2022-11-26 12:58:14,191] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_18-model_00-model_states.pt... 0: [2022-11-26 12:58:14,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_18-model_00-model_states.pt. 0: [2022-11-26 12:58:14,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_19-model_00-model_states.pt... 0: [2022-11-26 12:58:14,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_19-model_00-model_states.pt. 0: [2022-11-26 12:58:14,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_20-model_00-model_states.pt... 0: [2022-11-26 12:58:14,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_20-model_00-model_states.pt. 0: [2022-11-26 12:58:14,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_21-model_00-model_states.pt... 0: [2022-11-26 12:58:14,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_21-model_00-model_states.pt. 0: [2022-11-26 12:58:14,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_22-model_00-model_states.pt... 0: [2022-11-26 12:58:14,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_22-model_00-model_states.pt. 0: [2022-11-26 12:58:14,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_23-model_00-model_states.pt... 0: [2022-11-26 12:58:14,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_23-model_00-model_states.pt. 0: [2022-11-26 12:58:14,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_24-model_00-model_states.pt... 0: [2022-11-26 12:58:14,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_24-model_00-model_states.pt. 0: [2022-11-26 12:58:14,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_25-model_00-model_states.pt... 0: [2022-11-26 12:58:15,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_25-model_00-model_states.pt. 0: [2022-11-26 12:58:15,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_26-model_00-model_states.pt... 0: [2022-11-26 12:58:15,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_26-model_00-model_states.pt. 0: [2022-11-26 12:58:15,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_27-model_00-model_states.pt... 0: [2022-11-26 12:58:15,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_27-model_00-model_states.pt. 0: [2022-11-26 12:58:15,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_28-model_00-model_states.pt... 0: [2022-11-26 12:58:15,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_28-model_00-model_states.pt. 0: [2022-11-26 12:58:15,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_29-model_00-model_states.pt... 0: [2022-11-26 12:58:15,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_29-model_00-model_states.pt. 0: [2022-11-26 12:58:15,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_30-model_00-model_states.pt... 0: [2022-11-26 12:58:15,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_30-model_00-model_states.pt. 0: [2022-11-26 12:58:15,544] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/layer_32-model_00-model_states.pt... 0: [2022-11-26 12:58:15,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/layer_32-model_00-model_states.pt. 0: [2022-11-26 12:58:15,551] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step57000/mp_rank_00_model_states.pt 0: [2022-11-26 12:58:15,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/mp_rank_00_model_states.pt... 0: [2022-11-26 12:58:15,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/mp_rank_00_model_states.pt. 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 3: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 12:58:15,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step57000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 9: [2022-11-26 12:58:15,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 12:58:15,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 12:58:15,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 12:58:15,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 12:58:15,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 12:58:15,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 12:58:15,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:15,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:15,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:15,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 12:58:15,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 12:58:15,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:15,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:15,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:15,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 12:58:15,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 12:58:15,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:15,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:15,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:15,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:15,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:15,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 2: [2022-11-26 12:58:15,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:15,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 12:58:15,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 12:58:15,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 12:58:15,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:15,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:15,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:15,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:15,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:15,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:15,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:58:15,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 12:58:15,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 12:58:15,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:15,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:15,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:15,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 12:58:15,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 12:58:15,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 12:58:15,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 12:58:15,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:15,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:15,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:15,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 12:58:15,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:58:15,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 12:58:15,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:15,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 12:58:15,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:15,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 12:58:15,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 12:58:15,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:15,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 12:58:15,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:15,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 12:58:15,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 12:58:15,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:15,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:15,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:15,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:15,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:15,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:15,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:15,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:15,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:15,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:15,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 12:58:15,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:15,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:15,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:15,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 12:58:15,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:58:15,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 2: [2022-11-26 12:58:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 2: [2022-11-26 12:58:15,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 12:58:15,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:58:15,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:58:15,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:58:15,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:58:15,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:58:15,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:58:15,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:58:15,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:58:15,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:58:15,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:15,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:15,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:15,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:15,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:15,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:15,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:58:15,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:58:15,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:58:15,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:58:15,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:58:15,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:58:15,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 11: [2022-11-26 12:58:15,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 12:58:15,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 12:58:15,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 12:58:15,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:58:15,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 10: [2022-11-26 12:58:15,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 12:58:15,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 12:58:15,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:15,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:15,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:15,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 12: [2022-11-26 12:58:15,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 12:58:15,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 12:58:15,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 15: [2022-11-26 12:58:15,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 12:58:15,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 12:58:15,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:15,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:15,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:15,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 3: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 12:58:15,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 12:58:15,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 12:58:15,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 12:58:15,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 12:58:15,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 12:58:15,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 9: [2022-11-26 12:58:15,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 13: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-26 12:58:15,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 13: [2022-11-26 12:58:15,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 8: [2022-11-26 12:58:15,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 12:58:15,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 12:58:15,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 1: [2022-11-26 12:58:15,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 12:58:15,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 12:58:15,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:15,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:15,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:15,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 12:58:15,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 12:58:15,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 12:58:15,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 7: [2022-11-26 12:58:15,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 12:58:15,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 12:58:15,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: [2022-11-26 12:58:15,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 12:58:15,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 6: [2022-11-26 12:58:16,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 12:58:16,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 12:58:16,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 4: [2022-11-26 12:58:16,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 12:58:16,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 12:58:16,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 14: [2022-11-26 12:58:16,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 12:58:16,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 12:58:16,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 5: [2022-11-26 12:58:16,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 12:58:16,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step57000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 12:58:16,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step57000 is ready now! 0: successfully saved checkpoint at iteration 57000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3767.80 15: iteration 57010/ 125429 | consumed samples: 14594560 | consumed tokens: 29889658880 | elapsed time per iteration (s): 1.44 | learning rate: 1.244E-04 | global batch size: 256 | lm loss: 1.986617E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.290 | TFLOPs: 29.30 | 15: iteration 57020/ 125429 | consumed samples: 14597120 | consumed tokens: 29894901760 | elapsed time per iteration (s): 1.06 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 1.999233E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.289 | TFLOPs: 39.87 | 15: iteration 57030/ 125429 | consumed samples: 14599680 | consumed tokens: 29900144640 | elapsed time per iteration (s): 1.04 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 2.013432E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.040 | TFLOPs: 40.83 | 15: iteration 57040/ 125429 | consumed samples: 14602240 | consumed tokens: 29905387520 | elapsed time per iteration (s): 1.03 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 1.985770E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.571 | TFLOPs: 41.24 | 15: iteration 57050/ 125429 | consumed samples: 14604800 | consumed tokens: 29910630400 | elapsed time per iteration (s): 1.05 | learning rate: 1.243E-04 | global batch size: 256 | lm loss: 1.977102E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.726 | TFLOPs: 40.44 | 15: iteration 57060/ 125429 | consumed samples: 14607360 | consumed tokens: 29915873280 | elapsed time per iteration (s): 1.06 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 1.965745E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.586 | TFLOPs: 40.09 | 15: iteration 57070/ 125429 | consumed samples: 14609920 | consumed tokens: 29921116160 | elapsed time per iteration (s): 1.04 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 2.014328E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.467 | TFLOPs: 40.57 | 15: iteration 57080/ 125429 | consumed samples: 14612480 | consumed tokens: 29926359040 | elapsed time per iteration (s): 1.06 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 1.978279E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.205 | TFLOPs: 39.86 | 15: iteration 57090/ 125429 | consumed samples: 14615040 | consumed tokens: 29931601920 | elapsed time per iteration (s): 1.08 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 2.014034E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.081 | TFLOPs: 39.34 | 15: iteration 57100/ 125429 | consumed samples: 14617600 | consumed tokens: 29936844800 | elapsed time per iteration (s): 1.04 | learning rate: 1.242E-04 | global batch size: 256 | lm loss: 1.991074E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.871 | TFLOPs: 40.80 | 15: iteration 57110/ 125429 | consumed samples: 14620160 | consumed tokens: 29942087680 | elapsed time per iteration (s): 1.03 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 1.984379E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.883 | TFLOPs: 41.13 | 15: iteration 57120/ 125429 | consumed samples: 14622720 | consumed tokens: 29947330560 | elapsed time per iteration (s): 1.06 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 2.006691E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.428 | TFLOPs: 39.73 | 15: iteration 57130/ 125429 | consumed samples: 14625280 | consumed tokens: 29952573440 | elapsed time per iteration (s): 1.04 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 1.985894E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.422 | TFLOPs: 40.56 | 15: iteration 57140/ 125429 | consumed samples: 14627840 | consumed tokens: 29957816320 | elapsed time per iteration (s): 1.08 | learning rate: 1.241E-04 | global batch size: 256 | lm loss: 1.989499E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.870 | TFLOPs: 39.31 | 15: iteration 57150/ 125429 | consumed samples: 14630400 | consumed tokens: 29963059200 | elapsed time per iteration (s): 1.05 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 1.974612E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.835 | TFLOPs: 40.13 | 15: iteration 57160/ 125429 | consumed samples: 14632960 | consumed tokens: 29968302080 | elapsed time per iteration (s): 1.06 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 2.013770E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.857 | TFLOPs: 39.97 | 15: iteration 57170/ 125429 | consumed samples: 14635520 | consumed tokens: 29973544960 | elapsed time per iteration (s): 1.02 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 2.000587E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.497 | TFLOPs: 41.56 | 15: iteration 57180/ 125429 | consumed samples: 14638080 | consumed tokens: 29978787840 | elapsed time per iteration (s): 1.05 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 2.011238E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.765 | TFLOPs: 40.28 | 15: iteration 57190/ 125429 | consumed samples: 14640640 | consumed tokens: 29984030720 | elapsed time per iteration (s): 1.13 | learning rate: 1.240E-04 | global batch size: 256 | lm loss: 1.984282E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.513 | TFLOPs: 37.60 | 15: iteration 57200/ 125429 | consumed samples: 14643200 | consumed tokens: 29989273600 | elapsed time per iteration (s): 1.04 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 1.999388E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.525 | TFLOPs: 40.57 | 15: iteration 57210/ 125429 | consumed samples: 14645760 | consumed tokens: 29994516480 | elapsed time per iteration (s): 1.04 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 2.015111E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.514 | TFLOPs: 40.57 | 15: iteration 57220/ 125429 | consumed samples: 14648320 | consumed tokens: 29999759360 | elapsed time per iteration (s): 1.05 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 1.965126E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.289 | TFLOPs: 40.37 | 15: iteration 57230/ 125429 | consumed samples: 14650880 | consumed tokens: 30005002240 | elapsed time per iteration (s): 1.03 | learning rate: 1.239E-04 | global batch size: 256 | lm loss: 2.001628E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.501 | TFLOPs: 41.23 | 15: iteration 57240/ 125429 | consumed samples: 14653440 | consumed tokens: 30010245120 | elapsed time per iteration (s): 1.03 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.040126E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.346 | TFLOPs: 40.88 | 15: iteration 57250/ 125429 | consumed samples: 14656000 | consumed tokens: 30015488000 | elapsed time per iteration (s): 1.08 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.016767E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.640 | TFLOPs: 39.11 | 15: iteration 57260/ 125429 | consumed samples: 14658560 | consumed tokens: 30020730880 | elapsed time per iteration (s): 1.03 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 2.023483E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.804 | TFLOPs: 40.95 | 15: iteration 57270/ 125429 | consumed samples: 14661120 | consumed tokens: 30025973760 | elapsed time per iteration (s): 1.04 | learning rate: 1.238E-04 | global batch size: 256 | lm loss: 1.963642E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.563 | TFLOPs: 40.75 | 15: iteration 57280/ 125429 | consumed samples: 14663680 | consumed tokens: 30031216640 | elapsed time per iteration (s): 1.02 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 1.952566E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.612 | TFLOPs: 41.42 | 15: iteration 57290/ 125429 | consumed samples: 14666240 | consumed tokens: 30036459520 | elapsed time per iteration (s): 1.04 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 2.000939E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.599 | TFLOPs: 40.75 | 15: iteration 57300/ 125429 | consumed samples: 14668800 | consumed tokens: 30041702400 | elapsed time per iteration (s): 1.04 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 1.988553E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.952 | TFLOPs: 40.65 | 15: iteration 57310/ 125429 | consumed samples: 14671360 | consumed tokens: 30046945280 | elapsed time per iteration (s): 1.02 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 1.999169E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.419 | TFLOPs: 41.38 | 15: iteration 57320/ 125429 | consumed samples: 14673920 | consumed tokens: 30052188160 | elapsed time per iteration (s): 1.19 | learning rate: 1.237E-04 | global batch size: 256 | lm loss: 2.013849E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.122 | TFLOPs: 35.55 | 15: iteration 57330/ 125429 | consumed samples: 14676480 | consumed tokens: 30057431040 | elapsed time per iteration (s): 1.04 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 1.965083E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.925 | TFLOPs: 40.64 | 15: iteration 57340/ 125429 | consumed samples: 14679040 | consumed tokens: 30062673920 | elapsed time per iteration (s): 1.05 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 1.994272E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.176 | TFLOPs: 40.35 | 15: iteration 57350/ 125429 | consumed samples: 14681600 | consumed tokens: 30067916800 | elapsed time per iteration (s): 1.02 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 1.977913E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.480 | TFLOPs: 41.39 | 15: iteration 57360/ 125429 | consumed samples: 14684160 | consumed tokens: 30073159680 | elapsed time per iteration (s): 1.09 | learning rate: 1.236E-04 | global batch size: 256 | lm loss: 1.983278E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.641 | TFLOPs: 38.78 | 15: iteration 57370/ 125429 | consumed samples: 14686720 | consumed tokens: 30078402560 | elapsed time per iteration (s): 1.05 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 1.986782E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.510 | TFLOPs: 40.41 | 15: iteration 57380/ 125429 | consumed samples: 14689280 | consumed tokens: 30083645440 | elapsed time per iteration (s): 1.10 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.031195E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.343 | TFLOPs: 38.56 | 15: iteration 57390/ 125429 | consumed samples: 14691840 | consumed tokens: 30088888320 | elapsed time per iteration (s): 1.08 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 1.985435E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.763 | TFLOPs: 39.13 | 15: iteration 57400/ 125429 | consumed samples: 14694400 | consumed tokens: 30094131200 | elapsed time per iteration (s): 1.04 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.009919E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.073 | TFLOPs: 40.67 | 15: iteration 57410/ 125429 | consumed samples: 14696960 | consumed tokens: 30099374080 | elapsed time per iteration (s): 1.03 | learning rate: 1.235E-04 | global batch size: 256 | lm loss: 2.023205E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.862 | TFLOPs: 40.96 | 15: iteration 57420/ 125429 | consumed samples: 14699520 | consumed tokens: 30104616960 | elapsed time per iteration (s): 1.03 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 2.003986E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.457 | TFLOPs: 41.22 | 15: iteration 57430/ 125429 | consumed samples: 14702080 | consumed tokens: 30109859840 | elapsed time per iteration (s): 1.05 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 1.997248E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.929 | TFLOPs: 40.15 | 15: iteration 57440/ 125429 | consumed samples: 14704640 | consumed tokens: 30115102720 | elapsed time per iteration (s): 1.06 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 1.964233E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.779 | TFLOPs: 39.79 | 15: iteration 57450/ 125429 | consumed samples: 14707200 | consumed tokens: 30120345600 | elapsed time per iteration (s): 1.05 | learning rate: 1.234E-04 | global batch size: 256 | lm loss: 1.994057E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.815 | TFLOPs: 40.46 | 15: iteration 57460/ 125429 | consumed samples: 14709760 | consumed tokens: 30125588480 | elapsed time per iteration (s): 1.05 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 2.001367E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.710 | TFLOPs: 40.27 | 15: iteration 57470/ 125429 | consumed samples: 14712320 | consumed tokens: 30130831360 | elapsed time per iteration (s): 1.04 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 1.983690E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.056 | TFLOPs: 40.83 | 15: iteration 57480/ 125429 | consumed samples: 14714880 | consumed tokens: 30136074240 | elapsed time per iteration (s): 1.09 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 1.998975E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.469 | TFLOPs: 38.91 | 15: iteration 57490/ 125429 | consumed samples: 14717440 | consumed tokens: 30141317120 | elapsed time per iteration (s): 1.03 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 1.999140E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.090 | TFLOPs: 41.16 | 15: iteration 57500/ 125429 | consumed samples: 14720000 | consumed tokens: 30146560000 | elapsed time per iteration (s): 1.10 | learning rate: 1.233E-04 | global batch size: 256 | lm loss: 2.027529E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.678 | TFLOPs: 38.62 | 15: iteration 57510/ 125429 | consumed samples: 14722560 | consumed tokens: 30151802880 | elapsed time per iteration (s): 1.03 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 1.995696E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.737 | TFLOPs: 41.11 | 15: iteration 57520/ 125429 | consumed samples: 14725120 | consumed tokens: 30157045760 | elapsed time per iteration (s): 1.04 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.001707E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.692 | TFLOPs: 40.77 | 15: iteration 57530/ 125429 | consumed samples: 14727680 | consumed tokens: 30162288640 | elapsed time per iteration (s): 1.06 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 2.020486E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.413 | TFLOPs: 40.06 | 15: iteration 57540/ 125429 | consumed samples: 14730240 | consumed tokens: 30167531520 | elapsed time per iteration (s): 1.07 | learning rate: 1.232E-04 | global batch size: 256 | lm loss: 1.998898E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.881 | TFLOPs: 39.64 | 15: iteration 57550/ 125429 | consumed samples: 14732800 | consumed tokens: 30172774400 | elapsed time per iteration (s): 1.09 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.011808E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.424 | TFLOPs: 38.74 | 15: iteration 57560/ 125429 | consumed samples: 14735360 | consumed tokens: 30178017280 | elapsed time per iteration (s): 1.03 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 1.964383E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.248 | TFLOPs: 41.19 | 15: iteration 57570/ 125429 | consumed samples: 14737920 | consumed tokens: 30183260160 | elapsed time per iteration (s): 1.07 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.004119E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.222 | TFLOPs: 39.37 | 15: iteration 57580/ 125429 | consumed samples: 14740480 | consumed tokens: 30188503040 | elapsed time per iteration (s): 1.04 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 2.001463E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.663 | TFLOPs: 40.60 | 15: iteration 57590/ 125429 | consumed samples: 14743040 | consumed tokens: 30193745920 | elapsed time per iteration (s): 1.02 | learning rate: 1.231E-04 | global batch size: 256 | lm loss: 1.995803E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.478 | TFLOPs: 41.39 | 15: iteration 57600/ 125429 | consumed samples: 14745600 | consumed tokens: 30198988800 | elapsed time per iteration (s): 1.05 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 1.992259E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.863 | TFLOPs: 40.30 | 15: iteration 57610/ 125429 | consumed samples: 14748160 | consumed tokens: 30204231680 | elapsed time per iteration (s): 1.04 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 1.977164E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.173 | TFLOPs: 40.68 | 15: iteration 57620/ 125429 | consumed samples: 14750720 | consumed tokens: 30209474560 | elapsed time per iteration (s): 1.06 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 1.998600E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.379 | TFLOPs: 39.72 | 15: iteration 57630/ 125429 | consumed samples: 14753280 | consumed tokens: 30214717440 | elapsed time per iteration (s): 1.04 | learning rate: 1.230E-04 | global batch size: 256 | lm loss: 1.974501E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.619 | TFLOPs: 40.59 | 15: iteration 57640/ 125429 | consumed samples: 14755840 | consumed tokens: 30219960320 | elapsed time per iteration (s): 1.06 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 2.004425E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.489 | TFLOPs: 39.74 | 15: iteration 57650/ 125429 | consumed samples: 14758400 | consumed tokens: 30225203200 | elapsed time per iteration (s): 1.03 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 1.974516E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.289 | TFLOPs: 41.20 | 15: iteration 57660/ 125429 | consumed samples: 14760960 | consumed tokens: 30230446080 | elapsed time per iteration (s): 1.03 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 2.004997E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.748 | TFLOPs: 40.94 | 15: iteration 57670/ 125429 | consumed samples: 14763520 | consumed tokens: 30235688960 | elapsed time per iteration (s): 1.05 | learning rate: 1.229E-04 | global batch size: 256 | lm loss: 1.985968E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.451 | TFLOPs: 40.40 | 15: iteration 57680/ 125429 | consumed samples: 14766080 | consumed tokens: 30240931840 | elapsed time per iteration (s): 1.08 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 1.988688E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.281 | TFLOPs: 39.21 | 15: iteration 57690/ 125429 | consumed samples: 14768640 | consumed tokens: 30246174720 | elapsed time per iteration (s): 1.08 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 1.991422E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.122 | TFLOPs: 39.35 | 15: iteration 57700/ 125429 | consumed samples: 14771200 | consumed tokens: 30251417600 | elapsed time per iteration (s): 1.05 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 2.045549E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.992 | TFLOPs: 40.16 | 15: iteration 57710/ 125429 | consumed samples: 14773760 | consumed tokens: 30256660480 | elapsed time per iteration (s): 1.08 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 1.998829E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.962 | TFLOPs: 39.16 | 15: iteration 57720/ 125429 | consumed samples: 14776320 | consumed tokens: 30261903360 | elapsed time per iteration (s): 1.03 | learning rate: 1.228E-04 | global batch size: 256 | lm loss: 1.991695E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.063 | TFLOPs: 40.99 | 15: iteration 57730/ 125429 | consumed samples: 14778880 | consumed tokens: 30267146240 | elapsed time per iteration (s): 1.03 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 1.970390E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.990 | TFLOPs: 40.98 | 15: iteration 57740/ 125429 | consumed samples: 14781440 | consumed tokens: 30272389120 | elapsed time per iteration (s): 1.04 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 2.000879E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.215 | TFLOPs: 40.69 | 15: iteration 57750/ 125429 | consumed samples: 14784000 | consumed tokens: 30277632000 | elapsed time per iteration (s): 1.05 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 2.004867E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.289 | TFLOPs: 40.37 | 15: iteration 57760/ 125429 | consumed samples: 14786560 | consumed tokens: 30282874880 | elapsed time per iteration (s): 1.04 | learning rate: 1.227E-04 | global batch size: 256 | lm loss: 2.006158E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.147 | TFLOPs: 40.51 | 15: iteration 57770/ 125429 | consumed samples: 14789120 | consumed tokens: 30288117760 | elapsed time per iteration (s): 1.05 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.984387E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.550 | TFLOPs: 40.25 | 15: iteration 57780/ 125429 | consumed samples: 14791680 | consumed tokens: 30293360640 | elapsed time per iteration (s): 1.04 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.981360E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.588 | TFLOPs: 40.59 | 15: iteration 57790/ 125429 | consumed samples: 14794240 | consumed tokens: 30298603520 | elapsed time per iteration (s): 1.04 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.995924E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.376 | TFLOPs: 40.72 | 15: iteration 57800/ 125429 | consumed samples: 14796800 | consumed tokens: 30303846400 | elapsed time per iteration (s): 1.08 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 2.008829E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.117 | TFLOPs: 39.02 | 15: iteration 57810/ 125429 | consumed samples: 14799360 | consumed tokens: 30309089280 | elapsed time per iteration (s): 1.05 | learning rate: 1.226E-04 | global batch size: 256 | lm loss: 1.982007E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.836 | TFLOPs: 40.13 | 15: iteration 57820/ 125429 | consumed samples: 14801920 | consumed tokens: 30314332160 | elapsed time per iteration (s): 1.10 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 2.001781E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.673 | TFLOPs: 38.45 | 15: iteration 57830/ 125429 | consumed samples: 14804480 | consumed tokens: 30319575040 | elapsed time per iteration (s): 1.07 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 1.991814E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.709 | TFLOPs: 39.61 | 15: iteration 57840/ 125429 | consumed samples: 14807040 | consumed tokens: 30324817920 | elapsed time per iteration (s): 1.18 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 1.956600E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.654 | TFLOPs: 35.97 | 15: iteration 57850/ 125429 | consumed samples: 14809600 | consumed tokens: 30330060800 | elapsed time per iteration (s): 1.10 | learning rate: 1.225E-04 | global batch size: 256 | lm loss: 1.984448E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.340 | TFLOPs: 38.56 | 15: iteration 57860/ 125429 | consumed samples: 14812160 | consumed tokens: 30335303680 | elapsed time per iteration (s): 1.08 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 2.006941E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.025 | TFLOPs: 39.00 | 15: iteration 57870/ 125429 | consumed samples: 14814720 | consumed tokens: 30340546560 | elapsed time per iteration (s): 1.06 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 2.003129E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.797 | TFLOPs: 39.79 | 15: iteration 57880/ 125429 | consumed samples: 14817280 | consumed tokens: 30345789440 | elapsed time per iteration (s): 1.06 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 1.982773E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.049 | TFLOPs: 40.00 | 15: iteration 57890/ 125429 | consumed samples: 14819840 | consumed tokens: 30351032320 | elapsed time per iteration (s): 1.04 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 1.982521E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.747 | TFLOPs: 40.61 | 15: iteration 57900/ 125429 | consumed samples: 14822400 | consumed tokens: 30356275200 | elapsed time per iteration (s): 1.07 | learning rate: 1.224E-04 | global batch size: 256 | lm loss: 1.986839E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.207 | TFLOPs: 39.53 | 15: iteration 57910/ 125429 | consumed samples: 14824960 | consumed tokens: 30361518080 | elapsed time per iteration (s): 1.05 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 1.996248E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.529 | TFLOPs: 40.41 | 15: iteration 57920/ 125429 | consumed samples: 14827520 | consumed tokens: 30366760960 | elapsed time per iteration (s): 1.04 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 1.993408E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.833 | TFLOPs: 40.79 | 15: iteration 57930/ 125429 | consumed samples: 14830080 | consumed tokens: 30372003840 | elapsed time per iteration (s): 1.03 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 2.006543E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.289 | TFLOPs: 41.03 | 15: iteration 57940/ 125429 | consumed samples: 14832640 | consumed tokens: 30377246720 | elapsed time per iteration (s): 1.03 | learning rate: 1.223E-04 | global batch size: 256 | lm loss: 1.994843E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.310 | TFLOPs: 41.20 | 15: iteration 57950/ 125429 | consumed samples: 14835200 | consumed tokens: 30382489600 | elapsed time per iteration (s): 1.08 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 1.985640E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.894 | TFLOPs: 39.15 | 15: iteration 57960/ 125429 | consumed samples: 14837760 | consumed tokens: 30387732480 | elapsed time per iteration (s): 1.04 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 1.993904E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.899 | TFLOPs: 40.64 | 15: iteration 57970/ 125429 | consumed samples: 14840320 | consumed tokens: 30392975360 | elapsed time per iteration (s): 1.06 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 1.996403E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.811 | TFLOPs: 39.80 | 15: iteration 57980/ 125429 | consumed samples: 14842880 | consumed tokens: 30398218240 | elapsed time per iteration (s): 1.04 | learning rate: 1.222E-04 | global batch size: 256 | lm loss: 1.966929E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.062 | TFLOPs: 40.50 | 15: iteration 57990/ 125429 | consumed samples: 14845440 | consumed tokens: 30403461120 | elapsed time per iteration (s): 1.04 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 2.007906E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.292 | TFLOPs: 40.70 | 0: [2022-11-26 13:15:50,324] [INFO] [logging.py:68:log_dist] [Rank 0] step=58000, skipped=0, lr=[0.00012212588965616287, 0.00012212588965616287, 0.00012212588965616287], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 58000/ 125429 | consumed samples: 14848000 | consumed tokens: 30408704000 | elapsed time per iteration (s): 1.04 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 2.019375E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.599 | TFLOPs: 40.59 | 0: steps: 58000 loss: 1.9850 iter time (s): 1.048 samples/sec: 244.178 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 58000 | lm loss value: 1.943876E+00 | lm loss PPL: 6.985778E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 58000 to checkpoints_1b5 0: [2022-11-26 13:15:50,682] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step58000 is begin to save! 0: [2022-11-26 13:15:50,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:15:51,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:15:51,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:15:51,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:15:51,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:15:51,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:15:51,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:15:51,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:15:51,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:15:51,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:15:51,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:15:51,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:15:51,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:15:51,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:15:51,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:15:51,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:15:51,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:15:51,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:15:51,922] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:15:52,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:15:52,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:15:52,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:15:52,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:15:52,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:15:52,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:15:52,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:15:52,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:15:52,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:15:52,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:15:52,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:15:52,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:15:52,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:15:52,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:15:52,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:15:52,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:15:52,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:15:52,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:15:52,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:15:52,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:15:53,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:15:53,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:15:53,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:15:53,202] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:15:53,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:15:53,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:15:53,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:15:53,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:15:53,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:15:53,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:15:53,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:15:53,625] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:15:53,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:15:53,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:15:53,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:15:53,832] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_29-model_00-model_states.pt... 0: [2022-11-26 13:15:53,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_29-model_00-model_states.pt. 0: [2022-11-26 13:15:53,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:15:54,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:15:54,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/layer_32-model_00-model_states.pt... 0: [2022-11-26 13:15:54,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/layer_32-model_00-model_states.pt. 0: [2022-11-26 13:15:54,045] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step58000/mp_rank_00_model_states.pt 0: [2022-11-26 13:15:54,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:15:54,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:15:54,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step58000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:15:54,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:15:54,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:15:54,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 13:15:54,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:15:54,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:15:54,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 13:15:54,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:15:54,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:15:54,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 13:15:54,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 13:15:54,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:15:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:15:54,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:15:54,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 13:15:54,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:15:54,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 1: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 13:15:54,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 10: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 13:15:54,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 13:15:54,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 1: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 8: [2022-11-26 13:15:54,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 13:15:54,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 10: [2022-11-26 13:15:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 13:15:54,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 13:15:54,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 13:15:54,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:15:54,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 13:15:54,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 13:15:54,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:15:54,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 13:15:54,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 13:15:54,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 13:15:54,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 13:15:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:15:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 13:15:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:15:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 12: [2022-11-26 13:15:54,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 13:15:54,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 13:15:54,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 13:15:54,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 5: [2022-11-26 13:15:54,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:15:54,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 13:15:54,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 14: [2022-11-26 13:15:54,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:15:54,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 13:15:54,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:15:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 3: [2022-11-26 13:15:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 13:15:54,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:15:54,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 13: [2022-11-26 13:15:54,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 13:15:54,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 8: [2022-11-26 13:15:54,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 8: [2022-11-26 13:15:54,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 8: [2022-11-26 13:15:54,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:15:54,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 11: [2022-11-26 13:15:54,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:15:54,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 13:15:54,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 8: [2022-11-26 13:15:54,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:15:54,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 13:15:54,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 1: [2022-11-26 13:15:54,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:15:54,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 13:15:54,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 4: [2022-11-26 13:15:54,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:15:54,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:15:54,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 13:15:54,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 13:15:54,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 13:15:54,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 13:15:54,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:15:54,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 15: [2022-11-26 13:15:54,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:15:54,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 13:15:54,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 13:15:54,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:15:54,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 13:15:54,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:15:54,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 7: [2022-11-26 13:15:54,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 6: [2022-11-26 13:15:54,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:15:54,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:15:54,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 2: [2022-11-26 13:15:54,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:15:54,384] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:15:54,384] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: [2022-11-26 13:15:54,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 13:15:54,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step58000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 9: [2022-11-26 13:15:54,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step58000 is ready now! 0: successfully saved checkpoint at iteration 58000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3764.73 15: iteration 58010/ 125429 | consumed samples: 14850560 | consumed tokens: 30413946880 | elapsed time per iteration (s): 1.55 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 1.968144E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 164.797 | TFLOPs: 27.23 | 15: iteration 58020/ 125429 | consumed samples: 14853120 | consumed tokens: 30419189760 | elapsed time per iteration (s): 1.06 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 1.950889E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.757 | TFLOPs: 39.79 | 15: iteration 58030/ 125429 | consumed samples: 14855680 | consumed tokens: 30424432640 | elapsed time per iteration (s): 1.02 | learning rate: 1.221E-04 | global batch size: 256 | lm loss: 1.976061E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.836 | TFLOPs: 41.29 | 15: iteration 58040/ 125429 | consumed samples: 14858240 | consumed tokens: 30429675520 | elapsed time per iteration (s): 1.05 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 1.998405E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.872 | TFLOPs: 40.14 | 15: iteration 58050/ 125429 | consumed samples: 14860800 | consumed tokens: 30434918400 | elapsed time per iteration (s): 1.08 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 2.015657E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.183 | TFLOPs: 39.20 | 15: iteration 58060/ 125429 | consumed samples: 14863360 | consumed tokens: 30440161280 | elapsed time per iteration (s): 1.07 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 1.997940E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.101 | TFLOPs: 39.68 | 15: iteration 58070/ 125429 | consumed samples: 14865920 | consumed tokens: 30445404160 | elapsed time per iteration (s): 1.06 | learning rate: 1.220E-04 | global batch size: 256 | lm loss: 1.968028E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.025 | TFLOPs: 40.00 | 15: iteration 58080/ 125429 | consumed samples: 14868480 | consumed tokens: 30450647040 | elapsed time per iteration (s): 1.08 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.002999E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.509 | TFLOPs: 39.08 | 15: iteration 58090/ 125429 | consumed samples: 14871040 | consumed tokens: 30455889920 | elapsed time per iteration (s): 1.12 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.016962E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.114 | TFLOPs: 37.86 | 15: iteration 58100/ 125429 | consumed samples: 14873600 | consumed tokens: 30461132800 | elapsed time per iteration (s): 1.04 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 2.009079E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.263 | TFLOPs: 40.70 | 15: iteration 58110/ 125429 | consumed samples: 14876160 | consumed tokens: 30466375680 | elapsed time per iteration (s): 1.05 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 1.968434E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.806 | TFLOPs: 40.13 | 15: iteration 58120/ 125429 | consumed samples: 14878720 | consumed tokens: 30471618560 | elapsed time per iteration (s): 1.03 | learning rate: 1.219E-04 | global batch size: 256 | lm loss: 1.989911E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.960 | TFLOPs: 40.98 | 15: iteration 58130/ 125429 | consumed samples: 14881280 | consumed tokens: 30476861440 | elapsed time per iteration (s): 1.06 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 1.972545E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.545 | TFLOPs: 39.92 | 15: iteration 58140/ 125429 | consumed samples: 14883840 | consumed tokens: 30482104320 | elapsed time per iteration (s): 1.05 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 2.014482E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.675 | TFLOPs: 40.10 | 15: iteration 58150/ 125429 | consumed samples: 14886400 | consumed tokens: 30487347200 | elapsed time per iteration (s): 1.05 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 1.963405E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.821 | TFLOPs: 40.29 | 15: iteration 58160/ 125429 | consumed samples: 14888960 | consumed tokens: 30492590080 | elapsed time per iteration (s): 1.08 | learning rate: 1.218E-04 | global batch size: 256 | lm loss: 2.005758E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.056 | TFLOPs: 39.18 | 15: iteration 58170/ 125429 | consumed samples: 14891520 | consumed tokens: 30497832960 | elapsed time per iteration (s): 1.05 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.978139E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.451 | TFLOPs: 40.23 | 15: iteration 58180/ 125429 | consumed samples: 14894080 | consumed tokens: 30503075840 | elapsed time per iteration (s): 1.04 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.960168E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.610 | TFLOPs: 40.75 | 15: iteration 58190/ 125429 | consumed samples: 14896640 | consumed tokens: 30508318720 | elapsed time per iteration (s): 1.10 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 2.031734E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.621 | TFLOPs: 38.44 | 15: iteration 58200/ 125429 | consumed samples: 14899200 | consumed tokens: 30513561600 | elapsed time per iteration (s): 1.09 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 2.004583E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.941 | TFLOPs: 38.99 | 15: iteration 58210/ 125429 | consumed samples: 14901760 | consumed tokens: 30518804480 | elapsed time per iteration (s): 1.05 | learning rate: 1.217E-04 | global batch size: 256 | lm loss: 1.982100E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.905 | TFLOPs: 40.31 | 15: iteration 58220/ 125429 | consumed samples: 14904320 | consumed tokens: 30524047360 | elapsed time per iteration (s): 1.07 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 1.985151E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.316 | TFLOPs: 39.55 | 15: iteration 58230/ 125429 | consumed samples: 14906880 | consumed tokens: 30529290240 | elapsed time per iteration (s): 1.06 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 2.013938E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.456 | TFLOPs: 39.74 | 15: iteration 58240/ 125429 | consumed samples: 14909440 | consumed tokens: 30534533120 | elapsed time per iteration (s): 1.08 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 1.993597E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.206 | TFLOPs: 39.03 | 15: iteration 58250/ 125429 | consumed samples: 14912000 | consumed tokens: 30539776000 | elapsed time per iteration (s): 1.03 | learning rate: 1.216E-04 | global batch size: 256 | lm loss: 2.010889E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.356 | TFLOPs: 40.88 | 15: iteration 58260/ 125429 | consumed samples: 14914560 | consumed tokens: 30545018880 | elapsed time per iteration (s): 1.07 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 1.996214E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.774 | TFLOPs: 39.46 | 15: iteration 58270/ 125429 | consumed samples: 14917120 | consumed tokens: 30550261760 | elapsed time per iteration (s): 1.08 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 1.945677E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.983 | TFLOPs: 39.16 | 15: iteration 58280/ 125429 | consumed samples: 14919680 | consumed tokens: 30555504640 | elapsed time per iteration (s): 1.03 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 1.950234E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.355 | TFLOPs: 41.04 | 15: iteration 58290/ 125429 | consumed samples: 14922240 | consumed tokens: 30560747520 | elapsed time per iteration (s): 1.04 | learning rate: 1.215E-04 | global batch size: 256 | lm loss: 2.020744E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.374 | TFLOPs: 40.55 | 15: iteration 58300/ 125429 | consumed samples: 14924800 | consumed tokens: 30565990400 | elapsed time per iteration (s): 1.04 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 1.970025E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.323 | TFLOPs: 40.54 | 15: iteration 58310/ 125429 | consumed samples: 14927360 | consumed tokens: 30571233280 | elapsed time per iteration (s): 1.04 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 1.992038E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.222 | TFLOPs: 40.86 | 15: iteration 58320/ 125429 | consumed samples: 14929920 | consumed tokens: 30576476160 | elapsed time per iteration (s): 1.05 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 2.002250E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.549 | TFLOPs: 40.41 | 15: iteration 58330/ 125429 | consumed samples: 14932480 | consumed tokens: 30581719040 | elapsed time per iteration (s): 1.08 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 1.972601E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.142 | TFLOPs: 39.19 | 15: iteration 58340/ 125429 | consumed samples: 14935040 | consumed tokens: 30586961920 | elapsed time per iteration (s): 1.04 | learning rate: 1.214E-04 | global batch size: 256 | lm loss: 1.961559E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.573 | TFLOPs: 40.75 | 15: iteration 58350/ 125429 | consumed samples: 14937600 | consumed tokens: 30592204800 | elapsed time per iteration (s): 1.04 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.002211E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.224 | TFLOPs: 40.86 | 15: iteration 58360/ 125429 | consumed samples: 14940160 | consumed tokens: 30597447680 | elapsed time per iteration (s): 1.05 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.007923E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.928 | TFLOPs: 40.15 | 15: iteration 58370/ 125429 | consumed samples: 14942720 | consumed tokens: 30602690560 | elapsed time per iteration (s): 1.07 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 2.019121E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.653 | TFLOPs: 39.44 | 15: iteration 58380/ 125429 | consumed samples: 14945280 | consumed tokens: 30607933440 | elapsed time per iteration (s): 1.11 | learning rate: 1.213E-04 | global batch size: 256 | lm loss: 1.966273E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.169 | TFLOPs: 38.04 | 15: iteration 58390/ 125429 | consumed samples: 14947840 | consumed tokens: 30613176320 | elapsed time per iteration (s): 1.05 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 1.992898E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.753 | TFLOPs: 40.45 | 15: iteration 58400/ 125429 | consumed samples: 14950400 | consumed tokens: 30618419200 | elapsed time per iteration (s): 1.04 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 1.972371E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.186 | TFLOPs: 40.68 | 15: iteration 58410/ 125429 | consumed samples: 14952960 | consumed tokens: 30623662080 | elapsed time per iteration (s): 1.03 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 2.003646E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.438 | TFLOPs: 40.89 | 15: iteration 58420/ 125429 | consumed samples: 14955520 | consumed tokens: 30628904960 | elapsed time per iteration (s): 1.05 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 2.018514E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.010 | TFLOPs: 40.16 | 15: iteration 58430/ 125429 | consumed samples: 14958080 | consumed tokens: 30634147840 | elapsed time per iteration (s): 1.06 | learning rate: 1.212E-04 | global batch size: 256 | lm loss: 1.972799E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.670 | TFLOPs: 39.94 | 15: iteration 58440/ 125429 | consumed samples: 14960640 | consumed tokens: 30639390720 | elapsed time per iteration (s): 1.02 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 1.978618E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.936 | TFLOPs: 41.47 | 15: iteration 58450/ 125429 | consumed samples: 14963200 | consumed tokens: 30644633600 | elapsed time per iteration (s): 1.07 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 1.996062E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.715 | TFLOPs: 39.45 | 15: iteration 58460/ 125429 | consumed samples: 14965760 | consumed tokens: 30649876480 | elapsed time per iteration (s): 1.04 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 1.994453E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.191 | TFLOPs: 40.68 | 15: iteration 58470/ 125429 | consumed samples: 14968320 | consumed tokens: 30655119360 | elapsed time per iteration (s): 1.12 | learning rate: 1.211E-04 | global batch size: 256 | lm loss: 1.998998E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.281 | TFLOPs: 37.73 | 15: iteration 58480/ 125429 | consumed samples: 14970880 | consumed tokens: 30660362240 | elapsed time per iteration (s): 1.03 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 1.975923E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.170 | TFLOPs: 41.18 | 15: iteration 58490/ 125429 | consumed samples: 14973440 | consumed tokens: 30665605120 | elapsed time per iteration (s): 1.07 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 1.989281E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.393 | TFLOPs: 39.40 | 15: iteration 58500/ 125429 | consumed samples: 14976000 | consumed tokens: 30670848000 | elapsed time per iteration (s): 1.03 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 1.988267E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.346 | TFLOPs: 41.04 | 15: iteration 58510/ 125429 | consumed samples: 14978560 | consumed tokens: 30676090880 | elapsed time per iteration (s): 1.05 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 1.987442E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.906 | TFLOPs: 40.14 | 15: iteration 58520/ 125429 | consumed samples: 14981120 | consumed tokens: 30681333760 | elapsed time per iteration (s): 1.04 | learning rate: 1.210E-04 | global batch size: 256 | lm loss: 2.007619E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.051 | TFLOPs: 40.66 | 15: iteration 58530/ 125429 | consumed samples: 14983680 | consumed tokens: 30686576640 | elapsed time per iteration (s): 1.03 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 2.003325E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.504 | TFLOPs: 41.07 | 15: iteration 58540/ 125429 | consumed samples: 14986240 | consumed tokens: 30691819520 | elapsed time per iteration (s): 1.05 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 1.996356E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.477 | TFLOPs: 40.24 | 15: iteration 58550/ 125429 | consumed samples: 14988800 | consumed tokens: 30697062400 | elapsed time per iteration (s): 1.03 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 2.000214E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.698 | TFLOPs: 41.26 | 15: iteration 58560/ 125429 | consumed samples: 14991360 | consumed tokens: 30702305280 | elapsed time per iteration (s): 1.08 | learning rate: 1.209E-04 | global batch size: 256 | lm loss: 1.996325E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.210 | TFLOPs: 39.20 | 15: iteration 58570/ 125429 | consumed samples: 14993920 | consumed tokens: 30707548160 | elapsed time per iteration (s): 1.05 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.974372E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.050 | TFLOPs: 40.33 | 15: iteration 58580/ 125429 | consumed samples: 14996480 | consumed tokens: 30712791040 | elapsed time per iteration (s): 1.09 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.985293E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.181 | TFLOPs: 38.87 | 15: iteration 58590/ 125429 | consumed samples: 14999040 | consumed tokens: 30718033920 | elapsed time per iteration (s): 1.05 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.981968E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.628 | TFLOPs: 40.26 | 15: iteration 58600/ 125429 | consumed samples: 15001600 | consumed tokens: 30723276800 | elapsed time per iteration (s): 1.03 | learning rate: 1.208E-04 | global batch size: 256 | lm loss: 1.997739E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.137 | TFLOPs: 41.01 | 15: iteration 58610/ 125429 | consumed samples: 15004160 | consumed tokens: 30728519680 | elapsed time per iteration (s): 1.05 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 1.987663E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.406 | TFLOPs: 40.39 | 15: iteration 58620/ 125429 | consumed samples: 15006720 | consumed tokens: 30733762560 | elapsed time per iteration (s): 1.04 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 2.005088E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.376 | TFLOPs: 40.55 | 15: iteration 58630/ 125429 | consumed samples: 15009280 | consumed tokens: 30739005440 | elapsed time per iteration (s): 1.03 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 2.000299E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.629 | TFLOPs: 41.09 | 15: iteration 58640/ 125429 | consumed samples: 15011840 | consumed tokens: 30744248320 | elapsed time per iteration (s): 1.04 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 1.993827E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.060 | TFLOPs: 40.83 | 15: iteration 58650/ 125429 | consumed samples: 15014400 | consumed tokens: 30749491200 | elapsed time per iteration (s): 1.10 | learning rate: 1.207E-04 | global batch size: 256 | lm loss: 1.990269E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.699 | TFLOPs: 38.62 | 15: iteration 58660/ 125429 | consumed samples: 15016960 | consumed tokens: 30754734080 | elapsed time per iteration (s): 1.03 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 1.999244E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.719 | TFLOPs: 41.27 | 15: iteration 58670/ 125429 | consumed samples: 15019520 | consumed tokens: 30759976960 | elapsed time per iteration (s): 1.05 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 1.983185E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.308 | TFLOPs: 40.21 | 15: iteration 58680/ 125429 | consumed samples: 15022080 | consumed tokens: 30765219840 | elapsed time per iteration (s): 1.03 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 2.001142E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.728 | TFLOPs: 40.94 | 15: iteration 58690/ 125429 | consumed samples: 15024640 | consumed tokens: 30770462720 | elapsed time per iteration (s): 1.09 | learning rate: 1.206E-04 | global batch size: 256 | lm loss: 1.973938E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.127 | TFLOPs: 38.69 | 15: iteration 58700/ 125429 | consumed samples: 15027200 | consumed tokens: 30775705600 | elapsed time per iteration (s): 1.03 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 2.003367E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.910 | TFLOPs: 40.97 | 15: iteration 58710/ 125429 | consumed samples: 15029760 | consumed tokens: 30780948480 | elapsed time per iteration (s): 1.06 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 1.995245E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.030 | TFLOPs: 39.83 | 15: iteration 58720/ 125429 | consumed samples: 15032320 | consumed tokens: 30786191360 | elapsed time per iteration (s): 1.03 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 1.993061E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.589 | TFLOPs: 40.92 | 15: iteration 58730/ 125429 | consumed samples: 15034880 | consumed tokens: 30791434240 | elapsed time per iteration (s): 1.03 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 1.997603E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.845 | TFLOPs: 40.96 | 15: iteration 58740/ 125429 | consumed samples: 15037440 | consumed tokens: 30796677120 | elapsed time per iteration (s): 1.05 | learning rate: 1.205E-04 | global batch size: 256 | lm loss: 2.029149E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.630 | TFLOPs: 40.43 | 15: iteration 58750/ 125429 | consumed samples: 15040000 | consumed tokens: 30801920000 | elapsed time per iteration (s): 1.05 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 1.997674E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.246 | TFLOPs: 40.36 | 15: iteration 58760/ 125429 | consumed samples: 15042560 | consumed tokens: 30807162880 | elapsed time per iteration (s): 1.05 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 1.991790E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.118 | TFLOPs: 40.18 | 15: iteration 58770/ 125429 | consumed samples: 15045120 | consumed tokens: 30812405760 | elapsed time per iteration (s): 1.03 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 1.994044E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.519 | TFLOPs: 40.90 | 15: iteration 58780/ 125429 | consumed samples: 15047680 | consumed tokens: 30817648640 | elapsed time per iteration (s): 1.04 | learning rate: 1.204E-04 | global batch size: 256 | lm loss: 2.020399E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.414 | TFLOPs: 40.72 | 15: iteration 58790/ 125429 | consumed samples: 15050240 | consumed tokens: 30822891520 | elapsed time per iteration (s): 1.04 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 2.027799E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.242 | TFLOPs: 40.53 | 15: iteration 58800/ 125429 | consumed samples: 15052800 | consumed tokens: 30828134400 | elapsed time per iteration (s): 1.04 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 1.995103E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.271 | TFLOPs: 40.53 | 15: iteration 58810/ 125429 | consumed samples: 15055360 | consumed tokens: 30833377280 | elapsed time per iteration (s): 1.05 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 1.997559E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.466 | TFLOPs: 40.40 | 15: iteration 58820/ 125429 | consumed samples: 15057920 | consumed tokens: 30838620160 | elapsed time per iteration (s): 1.06 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 2.007949E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.633 | TFLOPs: 39.93 | 15: iteration 58830/ 125429 | consumed samples: 15060480 | consumed tokens: 30843863040 | elapsed time per iteration (s): 1.02 | learning rate: 1.203E-04 | global batch size: 256 | lm loss: 1.982567E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.592 | TFLOPs: 41.41 | 15: iteration 58840/ 125429 | consumed samples: 15063040 | consumed tokens: 30849105920 | elapsed time per iteration (s): 1.07 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 1.998429E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.788 | TFLOPs: 39.63 | 15: iteration 58850/ 125429 | consumed samples: 15065600 | consumed tokens: 30854348800 | elapsed time per iteration (s): 1.05 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 1.989580E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.518 | TFLOPs: 40.24 | 15: iteration 58860/ 125429 | consumed samples: 15068160 | consumed tokens: 30859591680 | elapsed time per iteration (s): 1.03 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 1.991570E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.351 | TFLOPs: 41.21 | 15: iteration 58870/ 125429 | consumed samples: 15070720 | consumed tokens: 30864834560 | elapsed time per iteration (s): 1.04 | learning rate: 1.202E-04 | global batch size: 256 | lm loss: 2.005602E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.922 | TFLOPs: 40.64 | 15: iteration 58880/ 125429 | consumed samples: 15073280 | consumed tokens: 30870077440 | elapsed time per iteration (s): 1.02 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 1.948753E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.407 | TFLOPs: 41.55 | 15: iteration 58890/ 125429 | consumed samples: 15075840 | consumed tokens: 30875320320 | elapsed time per iteration (s): 1.03 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 1.968917E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.512 | TFLOPs: 41.07 | 15: iteration 58900/ 125429 | consumed samples: 15078400 | consumed tokens: 30880563200 | elapsed time per iteration (s): 1.05 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 2.028535E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.962 | TFLOPs: 40.48 | 15: iteration 58910/ 125429 | consumed samples: 15080960 | consumed tokens: 30885806080 | elapsed time per iteration (s): 1.02 | learning rate: 1.201E-04 | global batch size: 256 | lm loss: 2.024952E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.501 | TFLOPs: 41.56 | 15: iteration 58920/ 125429 | consumed samples: 15083520 | consumed tokens: 30891048960 | elapsed time per iteration (s): 1.04 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 1.997861E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.067 | TFLOPs: 40.50 | 15: iteration 58930/ 125429 | consumed samples: 15086080 | consumed tokens: 30896291840 | elapsed time per iteration (s): 1.06 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 1.983486E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.545 | TFLOPs: 40.08 | 15: iteration 58940/ 125429 | consumed samples: 15088640 | consumed tokens: 30901534720 | elapsed time per iteration (s): 1.07 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 2.002541E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.296 | TFLOPs: 39.55 | 15: iteration 58950/ 125429 | consumed samples: 15091200 | consumed tokens: 30906777600 | elapsed time per iteration (s): 1.03 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 1.991981E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.815 | TFLOPs: 40.95 | 15: iteration 58960/ 125429 | consumed samples: 15093760 | consumed tokens: 30912020480 | elapsed time per iteration (s): 1.10 | learning rate: 1.200E-04 | global batch size: 256 | lm loss: 1.994763E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.221 | TFLOPs: 38.38 | 15: iteration 58970/ 125429 | consumed samples: 15096320 | consumed tokens: 30917263360 | elapsed time per iteration (s): 1.03 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 2.024932E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.707 | TFLOPs: 40.94 | 15: iteration 58980/ 125429 | consumed samples: 15098880 | consumed tokens: 30922506240 | elapsed time per iteration (s): 1.04 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 1.992745E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.198 | TFLOPs: 40.85 | 15: iteration 58990/ 125429 | consumed samples: 15101440 | consumed tokens: 30927749120 | elapsed time per iteration (s): 1.16 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 1.984335E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.580 | TFLOPs: 36.45 | 15: iteration 59000/ 125429 | consumed samples: 15104000 | consumed tokens: 30932992000 | elapsed time per iteration (s): 1.02 | learning rate: 1.199E-04 | global batch size: 256 | lm loss: 1.990913E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.524 | TFLOPs: 41.57 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 59000 | lm loss value: 1.949534E+00 | lm loss PPL: 7.025413E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 59000 to checkpoints_1b5 0: [2022-11-26 13:33:28,854] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step59000 is begin to save! 0: [2022-11-26 13:33:28,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:33:29,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:33:29,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:33:29,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:33:29,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:33:29,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:33:29,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:33:29,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:33:29,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:33:29,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:33:29,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:33:29,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:33:29,691] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:33:29,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:33:29,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:33:29,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:33:29,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:33:30,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:33:30,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:33:30,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:33:30,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:33:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:33:30,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:33:30,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:33:30,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:33:30,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:33:30,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:33:30,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:33:30,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:33:30,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:33:30,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:33:30,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:33:30,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:33:30,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:33:30,841] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:33:30,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:33:30,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:33:31,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:33:31,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:33:31,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:33:31,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:33:31,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:33:31,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:33:31,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:33:31,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:33:31,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:33:31,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:33:31,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:33:31,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:33:31,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:33:31,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:33:31,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:33:31,782] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:33:31,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:33:31,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_29-model_00-model_states.pt... 0: [2022-11-26 13:33:31,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_29-model_00-model_states.pt. 0: [2022-11-26 13:33:31,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:33:32,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:33:32,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/layer_32-model_00-model_states.pt... 0: [2022-11-26 13:33:32,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/layer_32-model_00-model_states.pt. 0: [2022-11-26 13:33:32,104] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step59000/mp_rank_00_model_states.pt 0: [2022-11-26 13:33:32,104] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:33:32,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:33:32,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:33:32,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step59000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:33:32,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:32,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:32,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 13:33:32,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:32,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 13:33:32,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:32,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:32,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 13:33:32,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:32,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:32,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 13:33:32,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:32,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 13:33:32,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 13:33:32,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:32,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 13:33:32,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:32,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 8: [2022-11-26 13:33:32,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:33:32,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 13:33:32,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:33:32,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 7: [2022-11-26 13:33:32,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:32,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:32,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 12: [2022-11-26 13:33:32,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:33:32,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 13:33:32,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:32,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:32,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:32,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:32,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:32,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:32,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 11: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:33:32,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:32,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:32,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:32,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:32,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:32,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:32,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:33:32,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 13:33:32,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 15: [2022-11-26 13:33:32,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 5: [2022-11-26 13:33:32,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:33:32,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 13:33:32,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 14: [2022-11-26 13:33:32,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:33:32,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 13:33:32,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 13:33:32,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 13:33:32,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 9: [2022-11-26 13:33:32,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:33:32,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 13:33:32,388] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 1: [2022-11-26 13:33:32,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:33:32,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 13:33:32,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 6: [2022-11-26 13:33:32,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:33:32,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:32,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:32,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:32,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:32,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 4: [2022-11-26 13:33:32,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:33:32,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 13:33:32,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 3: [2022-11-26 13:33:32,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:33:32,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:33:32,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:33:32,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: [2022-11-26 13:33:32,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 13:33:32,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 2: [2022-11-26 13:33:32,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 13: [2022-11-26 13:33:32,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step59000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 10: [2022-11-26 13:33:32,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step59000 is ready now! 0: successfully saved checkpoint at iteration 59000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3828.20 15: iteration 59010/ 125429 | consumed samples: 15106560 | consumed tokens: 30938234880 | elapsed time per iteration (s): 1.55 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 1.980319E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 165.433 | TFLOPs: 27.34 | 15: iteration 59020/ 125429 | consumed samples: 15109120 | consumed tokens: 30943477760 | elapsed time per iteration (s): 2.08 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 1.981353E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 122.927 | TFLOPs: 20.31 | 15: iteration 59030/ 125429 | consumed samples: 15111680 | consumed tokens: 30948720640 | elapsed time per iteration (s): 1.03 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 1.988809E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.665 | TFLOPs: 40.93 | 15: iteration 59040/ 125429 | consumed samples: 15114240 | consumed tokens: 30953963520 | elapsed time per iteration (s): 1.09 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 2.000120E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.619 | TFLOPs: 38.94 | 15: iteration 59050/ 125429 | consumed samples: 15116800 | consumed tokens: 30959206400 | elapsed time per iteration (s): 1.05 | learning rate: 1.198E-04 | global batch size: 256 | lm loss: 1.977648E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.040 | TFLOPs: 40.33 | 15: iteration 59060/ 125429 | consumed samples: 15119360 | consumed tokens: 30964449280 | elapsed time per iteration (s): 1.05 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 1.991237E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.415 | TFLOPs: 40.39 | 15: iteration 59070/ 125429 | consumed samples: 15121920 | consumed tokens: 30969692160 | elapsed time per iteration (s): 1.02 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 1.987221E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.780 | TFLOPs: 41.28 | 15: iteration 59080/ 125429 | consumed samples: 15124480 | consumed tokens: 30974935040 | elapsed time per iteration (s): 1.05 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 2.023671E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.290 | TFLOPs: 40.21 | 15: iteration 59090/ 125429 | consumed samples: 15127040 | consumed tokens: 30980177920 | elapsed time per iteration (s): 1.14 | learning rate: 1.197E-04 | global batch size: 256 | lm loss: 1.972681E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.797 | TFLOPs: 36.98 | 15: iteration 59100/ 125429 | consumed samples: 15129600 | consumed tokens: 30985420800 | elapsed time per iteration (s): 1.04 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 2.001833E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.110 | TFLOPs: 40.51 | 15: iteration 59110/ 125429 | consumed samples: 15132160 | consumed tokens: 30990663680 | elapsed time per iteration (s): 1.07 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 2.007528E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.664 | TFLOPs: 39.61 | 15: iteration 59120/ 125429 | consumed samples: 15134720 | consumed tokens: 30995906560 | elapsed time per iteration (s): 1.18 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 1.995797E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.420 | TFLOPs: 35.93 | 15: iteration 59130/ 125429 | consumed samples: 15137280 | consumed tokens: 31001149440 | elapsed time per iteration (s): 1.05 | learning rate: 1.196E-04 | global batch size: 256 | lm loss: 1.991345E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.015 | TFLOPs: 40.16 | 15: iteration 59140/ 125429 | consumed samples: 15139840 | consumed tokens: 31006392320 | elapsed time per iteration (s): 1.05 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 1.973679E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.409 | TFLOPs: 40.39 | 15: iteration 59150/ 125429 | consumed samples: 15142400 | consumed tokens: 31011635200 | elapsed time per iteration (s): 1.04 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 1.991363E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.863 | TFLOPs: 40.63 | 15: iteration 59160/ 125429 | consumed samples: 15144960 | consumed tokens: 31016878080 | elapsed time per iteration (s): 1.47 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 1.987556E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.174 | TFLOPs: 28.78 | 15: iteration 59170/ 125429 | consumed samples: 15147520 | consumed tokens: 31022120960 | elapsed time per iteration (s): 1.05 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 2.007438E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.623 | TFLOPs: 40.26 | 15: iteration 59180/ 125429 | consumed samples: 15150080 | consumed tokens: 31027363840 | elapsed time per iteration (s): 1.15 | learning rate: 1.195E-04 | global batch size: 256 | lm loss: 1.996441E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.435 | TFLOPs: 36.76 | 15: iteration 59190/ 125429 | consumed samples: 15152640 | consumed tokens: 31032606720 | elapsed time per iteration (s): 1.12 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 1.988292E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.523 | TFLOPs: 37.93 | 15: iteration 59200/ 125429 | consumed samples: 15155200 | consumed tokens: 31037849600 | elapsed time per iteration (s): 1.05 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 2.000209E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.848 | TFLOPs: 40.13 | 15: iteration 59210/ 125429 | consumed samples: 15157760 | consumed tokens: 31043092480 | elapsed time per iteration (s): 1.09 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 2.009106E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.578 | TFLOPs: 38.77 | 15: iteration 59220/ 125429 | consumed samples: 15160320 | consumed tokens: 31048335360 | elapsed time per iteration (s): 1.15 | learning rate: 1.194E-04 | global batch size: 256 | lm loss: 1.959529E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.817 | TFLOPs: 36.66 | 15: iteration 59230/ 125429 | consumed samples: 15162880 | consumed tokens: 31053578240 | elapsed time per iteration (s): 1.05 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 2.005666E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.870 | TFLOPs: 40.30 | 15: iteration 59240/ 125429 | consumed samples: 15165440 | consumed tokens: 31058821120 | elapsed time per iteration (s): 1.10 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 1.968986E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.432 | TFLOPs: 38.58 | 15: iteration 59250/ 125429 | consumed samples: 15168000 | consumed tokens: 31064064000 | elapsed time per iteration (s): 1.04 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 1.998619E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.759 | TFLOPs: 40.61 | 15: iteration 59260/ 125429 | consumed samples: 15170560 | consumed tokens: 31069306880 | elapsed time per iteration (s): 1.04 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 1.984838E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.119 | TFLOPs: 40.84 | 15: iteration 59270/ 125429 | consumed samples: 15173120 | consumed tokens: 31074549760 | elapsed time per iteration (s): 1.04 | learning rate: 1.193E-04 | global batch size: 256 | lm loss: 2.006609E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.511 | TFLOPs: 40.74 | 15: iteration 59280/ 125429 | consumed samples: 15175680 | consumed tokens: 31079792640 | elapsed time per iteration (s): 1.03 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 2.002053E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.957 | TFLOPs: 41.14 | 15: iteration 59290/ 125429 | consumed samples: 15178240 | consumed tokens: 31085035520 | elapsed time per iteration (s): 1.09 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 2.011912E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.300 | TFLOPs: 38.72 | 15: iteration 59300/ 125429 | consumed samples: 15180800 | consumed tokens: 31090278400 | elapsed time per iteration (s): 1.06 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 2.018616E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.257 | TFLOPs: 40.03 | 15: iteration 59310/ 125429 | consumed samples: 15183360 | consumed tokens: 31095521280 | elapsed time per iteration (s): 1.06 | learning rate: 1.192E-04 | global batch size: 256 | lm loss: 1.985951E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.103 | TFLOPs: 40.01 | 15: iteration 59320/ 125429 | consumed samples: 15185920 | consumed tokens: 31100764160 | elapsed time per iteration (s): 1.07 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 1.970562E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.648 | TFLOPs: 39.44 | 15: iteration 59330/ 125429 | consumed samples: 15188480 | consumed tokens: 31106007040 | elapsed time per iteration (s): 1.08 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 1.986809E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.532 | TFLOPs: 39.09 | 15: iteration 59340/ 125429 | consumed samples: 15191040 | consumed tokens: 31111249920 | elapsed time per iteration (s): 1.08 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 2.005993E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.372 | TFLOPs: 39.06 | 15: iteration 59350/ 125429 | consumed samples: 15193600 | consumed tokens: 31116492800 | elapsed time per iteration (s): 1.05 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 1.986385E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.381 | TFLOPs: 40.22 | 15: iteration 59360/ 125429 | consumed samples: 15196160 | consumed tokens: 31121735680 | elapsed time per iteration (s): 1.06 | learning rate: 1.191E-04 | global batch size: 256 | lm loss: 2.004906E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.652 | TFLOPs: 40.10 | 15: iteration 59370/ 125429 | consumed samples: 15198720 | consumed tokens: 31126978560 | elapsed time per iteration (s): 1.03 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 1.992776E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.448 | TFLOPs: 41.06 | 15: iteration 59380/ 125429 | consumed samples: 15201280 | consumed tokens: 31132221440 | elapsed time per iteration (s): 1.02 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 1.991060E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.449 | TFLOPs: 41.55 | 15: iteration 59390/ 125429 | consumed samples: 15203840 | consumed tokens: 31137464320 | elapsed time per iteration (s): 1.05 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 1.998530E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.736 | TFLOPs: 40.28 | 15: iteration 59400/ 125429 | consumed samples: 15206400 | consumed tokens: 31142707200 | elapsed time per iteration (s): 1.05 | learning rate: 1.190E-04 | global batch size: 256 | lm loss: 1.996548E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.679 | TFLOPs: 40.10 | 15: iteration 59410/ 125429 | consumed samples: 15208960 | consumed tokens: 31147950080 | elapsed time per iteration (s): 1.06 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 1.993681E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.855 | TFLOPs: 39.97 | 15: iteration 59420/ 125429 | consumed samples: 15211520 | consumed tokens: 31153192960 | elapsed time per iteration (s): 1.05 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 2.014480E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.343 | TFLOPs: 40.21 | 15: iteration 59430/ 125429 | consumed samples: 15214080 | consumed tokens: 31158435840 | elapsed time per iteration (s): 1.04 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 2.006200E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.511 | TFLOPs: 40.74 | 15: iteration 59440/ 125429 | consumed samples: 15216640 | consumed tokens: 31163678720 | elapsed time per iteration (s): 1.26 | learning rate: 1.189E-04 | global batch size: 256 | lm loss: 1.971357E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 203.499 | TFLOPs: 33.63 | 15: iteration 59450/ 125429 | consumed samples: 15219200 | consumed tokens: 31168921600 | elapsed time per iteration (s): 1.04 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 1.993883E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.745 | TFLOPs: 40.61 | 15: iteration 59460/ 125429 | consumed samples: 15221760 | consumed tokens: 31174164480 | elapsed time per iteration (s): 1.05 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 1.953464E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.242 | TFLOPs: 40.36 | 15: iteration 59470/ 125429 | consumed samples: 15224320 | consumed tokens: 31179407360 | elapsed time per iteration (s): 1.08 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.000437E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.808 | TFLOPs: 39.13 | 15: iteration 59480/ 125429 | consumed samples: 15226880 | consumed tokens: 31184650240 | elapsed time per iteration (s): 1.05 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 2.002636E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.858 | TFLOPs: 40.46 | 15: iteration 59490/ 125429 | consumed samples: 15229440 | consumed tokens: 31189893120 | elapsed time per iteration (s): 1.06 | learning rate: 1.188E-04 | global batch size: 256 | lm loss: 1.958294E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.950 | TFLOPs: 39.98 | 15: iteration 59500/ 125429 | consumed samples: 15232000 | consumed tokens: 31195136000 | elapsed time per iteration (s): 1.03 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 2.009486E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.949 | TFLOPs: 41.14 | 15: iteration 59510/ 125429 | consumed samples: 15234560 | consumed tokens: 31200378880 | elapsed time per iteration (s): 1.05 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.970617E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.207 | TFLOPs: 40.19 | 15: iteration 59520/ 125429 | consumed samples: 15237120 | consumed tokens: 31205621760 | elapsed time per iteration (s): 1.11 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.990824E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.730 | TFLOPs: 37.96 | 15: iteration 59530/ 125429 | consumed samples: 15239680 | consumed tokens: 31210864640 | elapsed time per iteration (s): 1.08 | learning rate: 1.187E-04 | global batch size: 256 | lm loss: 1.992880E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.145 | TFLOPs: 39.02 | 15: iteration 59540/ 125429 | consumed samples: 15242240 | consumed tokens: 31216107520 | elapsed time per iteration (s): 1.08 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.993204E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.120 | TFLOPs: 39.19 | 15: iteration 59550/ 125429 | consumed samples: 15244800 | consumed tokens: 31221350400 | elapsed time per iteration (s): 1.04 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.984580E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.653 | TFLOPs: 40.76 | 15: iteration 59560/ 125429 | consumed samples: 15247360 | consumed tokens: 31226593280 | elapsed time per iteration (s): 1.08 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 2.000426E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.959 | TFLOPs: 39.32 | 15: iteration 59570/ 125429 | consumed samples: 15249920 | consumed tokens: 31231836160 | elapsed time per iteration (s): 1.04 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.968805E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.261 | TFLOPs: 40.86 | 15: iteration 59580/ 125429 | consumed samples: 15252480 | consumed tokens: 31237079040 | elapsed time per iteration (s): 1.03 | learning rate: 1.186E-04 | global batch size: 256 | lm loss: 1.950797E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.055 | TFLOPs: 41.16 | 15: iteration 59590/ 125429 | consumed samples: 15255040 | consumed tokens: 31242321920 | elapsed time per iteration (s): 1.07 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 1.993626E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.328 | TFLOPs: 39.55 | 15: iteration 59600/ 125429 | consumed samples: 15257600 | consumed tokens: 31247564800 | elapsed time per iteration (s): 1.03 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 2.011410E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.717 | TFLOPs: 41.27 | 15: iteration 59610/ 125429 | consumed samples: 15260160 | consumed tokens: 31252807680 | elapsed time per iteration (s): 1.04 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 1.994286E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.069 | TFLOPs: 40.83 | 15: iteration 59620/ 125429 | consumed samples: 15262720 | consumed tokens: 31258050560 | elapsed time per iteration (s): 1.05 | learning rate: 1.185E-04 | global batch size: 256 | lm loss: 1.974474E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.493 | TFLOPs: 40.24 | 15: iteration 59630/ 125429 | consumed samples: 15265280 | consumed tokens: 31263293440 | elapsed time per iteration (s): 1.06 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 1.987473E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.841 | TFLOPs: 39.80 | 15: iteration 59640/ 125429 | consumed samples: 15267840 | consumed tokens: 31268536320 | elapsed time per iteration (s): 1.06 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 2.009409E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.407 | TFLOPs: 39.89 | 15: iteration 59650/ 125429 | consumed samples: 15270400 | consumed tokens: 31273779200 | elapsed time per iteration (s): 1.05 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 2.005468E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.040 | TFLOPs: 40.33 | 15: iteration 59660/ 125429 | consumed samples: 15272960 | consumed tokens: 31279022080 | elapsed time per iteration (s): 1.04 | learning rate: 1.184E-04 | global batch size: 256 | lm loss: 1.986959E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.597 | TFLOPs: 40.75 | 15: iteration 59670/ 125429 | consumed samples: 15275520 | consumed tokens: 31284264960 | elapsed time per iteration (s): 1.07 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 1.978668E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.344 | TFLOPs: 39.39 | 15: iteration 59680/ 125429 | consumed samples: 15278080 | consumed tokens: 31289507840 | elapsed time per iteration (s): 1.09 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 1.999851E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.779 | TFLOPs: 38.80 | 15: iteration 59690/ 125429 | consumed samples: 15280640 | consumed tokens: 31294750720 | elapsed time per iteration (s): 1.04 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 2.017989E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.951 | TFLOPs: 40.81 | 15: iteration 59700/ 125429 | consumed samples: 15283200 | consumed tokens: 31299993600 | elapsed time per iteration (s): 1.06 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 1.992780E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.666 | TFLOPs: 39.77 | 15: iteration 59710/ 125429 | consumed samples: 15285760 | consumed tokens: 31305236480 | elapsed time per iteration (s): 1.03 | learning rate: 1.183E-04 | global batch size: 256 | lm loss: 1.961046E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.255 | TFLOPs: 41.19 | 15: iteration 59720/ 125429 | consumed samples: 15288320 | consumed tokens: 31310479360 | elapsed time per iteration (s): 1.02 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 1.991751E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.219 | TFLOPs: 41.35 | 15: iteration 59730/ 125429 | consumed samples: 15290880 | consumed tokens: 31315722240 | elapsed time per iteration (s): 1.08 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 1.984937E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.495 | TFLOPs: 39.25 | 15: iteration 59740/ 125429 | consumed samples: 15293440 | consumed tokens: 31320965120 | elapsed time per iteration (s): 1.07 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 1.993540E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.209 | TFLOPs: 39.70 | 15: iteration 59750/ 125429 | consumed samples: 15296000 | consumed tokens: 31326208000 | elapsed time per iteration (s): 1.03 | learning rate: 1.182E-04 | global batch size: 256 | lm loss: 1.956650E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.635 | TFLOPs: 41.09 | 15: iteration 59760/ 125429 | consumed samples: 15298560 | consumed tokens: 31331450880 | elapsed time per iteration (s): 1.03 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.993268E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.775 | TFLOPs: 41.11 | 15: iteration 59770/ 125429 | consumed samples: 15301120 | consumed tokens: 31336693760 | elapsed time per iteration (s): 1.08 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.992779E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.862 | TFLOPs: 39.14 | 15: iteration 59780/ 125429 | consumed samples: 15303680 | consumed tokens: 31341936640 | elapsed time per iteration (s): 1.03 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.999637E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.367 | TFLOPs: 41.21 | 15: iteration 59790/ 125429 | consumed samples: 15306240 | consumed tokens: 31347179520 | elapsed time per iteration (s): 1.06 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.981883E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.620 | TFLOPs: 40.09 | 15: iteration 59800/ 125429 | consumed samples: 15308800 | consumed tokens: 31352422400 | elapsed time per iteration (s): 1.03 | learning rate: 1.181E-04 | global batch size: 256 | lm loss: 1.989140E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.403 | TFLOPs: 41.05 | 15: iteration 59810/ 125429 | consumed samples: 15311360 | consumed tokens: 31357665280 | elapsed time per iteration (s): 1.05 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.002796E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.946 | TFLOPs: 40.15 | 15: iteration 59820/ 125429 | consumed samples: 15313920 | consumed tokens: 31362908160 | elapsed time per iteration (s): 1.06 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 1.988600E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.212 | TFLOPs: 40.03 | 15: iteration 59830/ 125429 | consumed samples: 15316480 | consumed tokens: 31368151040 | elapsed time per iteration (s): 1.03 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 2.001560E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.981 | TFLOPs: 41.15 | 15: iteration 59840/ 125429 | consumed samples: 15319040 | consumed tokens: 31373393920 | elapsed time per iteration (s): 1.05 | learning rate: 1.180E-04 | global batch size: 256 | lm loss: 1.987255E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.130 | TFLOPs: 40.18 | 15: iteration 59850/ 125429 | consumed samples: 15321600 | consumed tokens: 31378636800 | elapsed time per iteration (s): 1.04 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 1.998211E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.652 | TFLOPs: 40.60 | 15: iteration 59860/ 125429 | consumed samples: 15324160 | consumed tokens: 31383879680 | elapsed time per iteration (s): 1.04 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 1.983755E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.749 | TFLOPs: 40.61 | 15: iteration 59870/ 125429 | consumed samples: 15326720 | consumed tokens: 31389122560 | elapsed time per iteration (s): 1.06 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 1.997315E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.851 | TFLOPs: 39.80 | 15: iteration 59880/ 125429 | consumed samples: 15329280 | consumed tokens: 31394365440 | elapsed time per iteration (s): 1.05 | learning rate: 1.179E-04 | global batch size: 256 | lm loss: 1.982055E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.709 | TFLOPs: 40.11 | 15: iteration 59890/ 125429 | consumed samples: 15331840 | consumed tokens: 31399608320 | elapsed time per iteration (s): 1.08 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 1.970617E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.807 | TFLOPs: 39.13 | 15: iteration 59900/ 125429 | consumed samples: 15334400 | consumed tokens: 31404851200 | elapsed time per iteration (s): 1.03 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 2.004396E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.108 | TFLOPs: 41.17 | 15: iteration 59910/ 125429 | consumed samples: 15336960 | consumed tokens: 31410094080 | elapsed time per iteration (s): 1.07 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 2.008888E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.002 | TFLOPs: 39.50 | 15: iteration 59920/ 125429 | consumed samples: 15339520 | consumed tokens: 31415336960 | elapsed time per iteration (s): 1.05 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 1.997035E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.582 | TFLOPs: 40.25 | 15: iteration 59930/ 125429 | consumed samples: 15342080 | consumed tokens: 31420579840 | elapsed time per iteration (s): 1.02 | learning rate: 1.178E-04 | global batch size: 256 | lm loss: 1.968058E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.998 | TFLOPs: 41.31 | 15: iteration 59940/ 125429 | consumed samples: 15344640 | consumed tokens: 31425822720 | elapsed time per iteration (s): 1.05 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 1.979958E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.869 | TFLOPs: 40.30 | 15: iteration 59950/ 125429 | consumed samples: 15347200 | consumed tokens: 31431065600 | elapsed time per iteration (s): 1.08 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 2.002363E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.020 | TFLOPs: 39.17 | 15: iteration 59960/ 125429 | consumed samples: 15349760 | consumed tokens: 31436308480 | elapsed time per iteration (s): 1.03 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 1.999763E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.247 | TFLOPs: 41.19 | 15: iteration 59970/ 125429 | consumed samples: 15352320 | consumed tokens: 31441551360 | elapsed time per iteration (s): 1.05 | learning rate: 1.177E-04 | global batch size: 256 | lm loss: 1.992483E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.926 | TFLOPs: 40.31 | 15: iteration 59980/ 125429 | consumed samples: 15354880 | consumed tokens: 31446794240 | elapsed time per iteration (s): 1.05 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 1.986273E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.440 | TFLOPs: 40.23 | 15: iteration 59990/ 125429 | consumed samples: 15357440 | consumed tokens: 31452037120 | elapsed time per iteration (s): 1.03 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 2.001257E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.661 | TFLOPs: 41.09 | 0: [2022-11-26 13:51:27,488] [INFO] [logging.py:68:log_dist] [Rank 0] step=60000, skipped=0, lr=[0.00011759985999730801, 0.00011759985999730801, 0.00011759985999730801], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 60000/ 125429 | consumed samples: 15360000 | consumed tokens: 31457280000 | elapsed time per iteration (s): 1.06 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 1.966232E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.257 | TFLOPs: 39.87 | 0: steps: 60000 loss: 2.0483 iter time (s): 1.062 samples/sec: 241.080 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 60000 | lm loss value: 1.938465E+00 | lm loss PPL: 6.948077E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 60000 to checkpoints_1b5 0: [2022-11-26 13:51:27,865] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step60000 is begin to save! 0: [2022-11-26 13:51:27,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_01-model_00-model_states.pt... 0: [2022-11-26 13:51:28,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_01-model_00-model_states.pt. 0: [2022-11-26 13:51:28,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_03-model_00-model_states.pt... 0: [2022-11-26 13:51:28,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_03-model_00-model_states.pt. 0: [2022-11-26 13:51:28,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_04-model_00-model_states.pt... 0: [2022-11-26 13:51:28,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_04-model_00-model_states.pt. 0: [2022-11-26 13:51:28,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_05-model_00-model_states.pt... 0: [2022-11-26 13:51:28,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_05-model_00-model_states.pt. 0: [2022-11-26 13:51:28,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_06-model_00-model_states.pt... 0: [2022-11-26 13:51:28,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_06-model_00-model_states.pt. 0: [2022-11-26 13:51:28,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_07-model_00-model_states.pt... 0: [2022-11-26 13:51:28,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_07-model_00-model_states.pt. 0: [2022-11-26 13:51:28,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_08-model_00-model_states.pt... 0: [2022-11-26 13:51:28,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_08-model_00-model_states.pt. 0: [2022-11-26 13:51:28,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_09-model_00-model_states.pt... 0: [2022-11-26 13:51:28,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_09-model_00-model_states.pt. 0: [2022-11-26 13:51:28,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_10-model_00-model_states.pt... 0: [2022-11-26 13:51:29,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_10-model_00-model_states.pt. 0: [2022-11-26 13:51:29,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_11-model_00-model_states.pt... 0: [2022-11-26 13:51:29,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_11-model_00-model_states.pt. 0: [2022-11-26 13:51:29,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_12-model_00-model_states.pt... 0: [2022-11-26 13:51:29,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_12-model_00-model_states.pt. 0: [2022-11-26 13:51:29,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_13-model_00-model_states.pt... 0: [2022-11-26 13:51:29,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_13-model_00-model_states.pt. 0: [2022-11-26 13:51:29,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_14-model_00-model_states.pt... 0: [2022-11-26 13:51:29,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_14-model_00-model_states.pt. 0: [2022-11-26 13:51:29,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_15-model_00-model_states.pt... 0: [2022-11-26 13:51:29,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_15-model_00-model_states.pt. 0: [2022-11-26 13:51:29,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_16-model_00-model_states.pt... 0: [2022-11-26 13:51:29,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_16-model_00-model_states.pt. 0: [2022-11-26 13:51:29,735] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_17-model_00-model_states.pt... 0: [2022-11-26 13:51:29,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_17-model_00-model_states.pt. 0: [2022-11-26 13:51:29,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_18-model_00-model_states.pt... 0: [2022-11-26 13:51:29,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_18-model_00-model_states.pt. 0: [2022-11-26 13:51:29,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_19-model_00-model_states.pt... 0: [2022-11-26 13:51:30,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_19-model_00-model_states.pt. 0: [2022-11-26 13:51:30,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_20-model_00-model_states.pt... 0: [2022-11-26 13:51:30,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_20-model_00-model_states.pt. 0: [2022-11-26 13:51:30,165] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_21-model_00-model_states.pt... 0: [2022-11-26 13:51:30,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_21-model_00-model_states.pt. 0: [2022-11-26 13:51:30,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_22-model_00-model_states.pt... 0: [2022-11-26 13:51:30,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_22-model_00-model_states.pt. 0: [2022-11-26 13:51:30,377] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_23-model_00-model_states.pt... 0: [2022-11-26 13:51:30,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_23-model_00-model_states.pt. 0: [2022-11-26 13:51:30,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_24-model_00-model_states.pt... 0: [2022-11-26 13:51:30,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_24-model_00-model_states.pt. 0: [2022-11-26 13:51:30,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_25-model_00-model_states.pt... 0: [2022-11-26 13:51:30,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_25-model_00-model_states.pt. 0: [2022-11-26 13:51:30,695] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_26-model_00-model_states.pt... 0: [2022-11-26 13:51:30,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_26-model_00-model_states.pt. 0: [2022-11-26 13:51:30,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_27-model_00-model_states.pt... 0: [2022-11-26 13:51:30,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_27-model_00-model_states.pt. 0: [2022-11-26 13:51:30,904] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_28-model_00-model_states.pt... 0: [2022-11-26 13:51:31,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_28-model_00-model_states.pt. 0: [2022-11-26 13:51:31,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_29-model_00-model_states.pt... 0: [2022-11-26 13:51:31,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_29-model_00-model_states.pt. 0: [2022-11-26 13:51:31,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_30-model_00-model_states.pt... 0: [2022-11-26 13:51:31,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_30-model_00-model_states.pt. 0: [2022-11-26 13:51:31,217] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/layer_32-model_00-model_states.pt... 0: [2022-11-26 13:51:31,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/layer_32-model_00-model_states.pt. 0: [2022-11-26 13:51:31,223] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step60000/mp_rank_00_model_states.pt 0: [2022-11-26 13:51:31,224] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/mp_rank_00_model_states.pt... 0: [2022-11-26 13:51:31,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/mp_rank_00_model_states.pt. 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 6: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 13:51:31,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step60000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 9: [2022-11-26 13:51:31,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:51:31,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 13:51:31,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:51:31,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 13:51:31,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 13:51:31,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 13:51:31,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 13:51:31,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 13:51:31,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 13:51:31,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 13:51:31,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 13:51:31,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:51:31,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:51:31,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 13:51:31,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 13:51:31,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 13:51:31,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 8: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 12: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:51:31,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:51:31,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:51:31,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:51:31,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 13:51:31,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-26 13:51:31,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 13:51:31,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:51:31,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 13:51:31,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:51:31,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 13:51:31,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:51:31,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:51:31,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 12: [2022-11-26 13:51:31,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 10: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 13:51:31,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 13:51:31,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:51:31,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 13:51:31,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 13:51:31,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 13:51:31,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 13:51:31,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:51:31,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:51:31,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:51:31,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:51:31,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:51:31,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 3: [2022-11-26 13:51:31,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:51:31,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 13:51:31,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 13:51:31,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 13:51:31,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 11: [2022-11-26 13:51:31,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 8: [2022-11-26 13:51:31,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 11: [2022-11-26 13:51:31,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 8: [2022-11-26 13:51:31,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 13:51:31,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 13:51:31,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 13:51:31,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:51:31,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 13:51:31,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 9: [2022-11-26 13:51:31,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 13:51:31,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 13:51:31,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 1: [2022-11-26 13:51:31,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 13:51:31,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 13:51:31,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 12: [2022-11-26 13:51:31,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 13:51:31,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 13:51:31,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 13: [2022-11-26 13:51:31,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 13:51:31,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 13:51:31,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 13:51:31,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 2: [2022-11-26 13:51:31,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 13:51:31,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 13:51:31,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 13:51:31,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 5: [2022-11-26 13:51:31,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 13:51:31,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 13:51:31,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 13:51:31,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 13:51:31,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 6: [2022-11-26 13:51:31,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 13:51:31,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 13:51:31,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 4: [2022-11-26 13:51:31,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 13:51:31,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 13:51:31,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: [2022-11-26 13:51:31,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 13:51:31,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 7: [2022-11-26 13:51:31,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 13:51:31,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 13:51:31,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 15: [2022-11-26 13:51:31,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 13:51:31,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step60000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 14: [2022-11-26 13:51:31,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step60000 is ready now! 0: successfully saved checkpoint at iteration 60000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3930.98 15: iteration 60010/ 125429 | consumed samples: 15362560 | consumed tokens: 31462522880 | elapsed time per iteration (s): 1.49 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 1.967852E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.249 | TFLOPs: 28.30 | 15: iteration 60020/ 125429 | consumed samples: 15365120 | consumed tokens: 31467765760 | elapsed time per iteration (s): 1.05 | learning rate: 1.176E-04 | global batch size: 256 | lm loss: 1.963909E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.912 | TFLOPs: 40.14 | 15: iteration 60030/ 125429 | consumed samples: 15367680 | consumed tokens: 31473008640 | elapsed time per iteration (s): 1.14 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 2.001793E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.079 | TFLOPs: 37.03 | 15: iteration 60040/ 125429 | consumed samples: 15370240 | consumed tokens: 31478251520 | elapsed time per iteration (s): 1.06 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 1.984225E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.652 | TFLOPs: 39.93 | 15: iteration 60050/ 125429 | consumed samples: 15372800 | consumed tokens: 31483494400 | elapsed time per iteration (s): 1.06 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 1.948870E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.236 | TFLOPs: 40.03 | 15: iteration 60060/ 125429 | consumed samples: 15375360 | consumed tokens: 31488737280 | elapsed time per iteration (s): 1.08 | learning rate: 1.175E-04 | global batch size: 256 | lm loss: 1.993724E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.957 | TFLOPs: 39.32 | 15: iteration 60070/ 125429 | consumed samples: 15377920 | consumed tokens: 31493980160 | elapsed time per iteration (s): 1.05 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 2.000262E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.113 | TFLOPs: 40.18 | 15: iteration 60080/ 125429 | consumed samples: 15380480 | consumed tokens: 31499223040 | elapsed time per iteration (s): 1.04 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 1.963304E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.746 | TFLOPs: 40.61 | 15: iteration 60090/ 125429 | consumed samples: 15383040 | consumed tokens: 31504465920 | elapsed time per iteration (s): 1.03 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 2.003131E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.617 | TFLOPs: 40.92 | 15: iteration 60100/ 125429 | consumed samples: 15385600 | consumed tokens: 31509708800 | elapsed time per iteration (s): 1.04 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 1.989365E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.116 | TFLOPs: 40.51 | 15: iteration 60110/ 125429 | consumed samples: 15388160 | consumed tokens: 31514951680 | elapsed time per iteration (s): 1.04 | learning rate: 1.174E-04 | global batch size: 256 | lm loss: 2.003031E+00 | grad norm: 0.123 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.016 | TFLOPs: 40.66 | 15: iteration 60120/ 125429 | consumed samples: 15390720 | consumed tokens: 31520194560 | elapsed time per iteration (s): 1.02 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.959873E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.583 | TFLOPs: 41.58 | 15: iteration 60130/ 125429 | consumed samples: 15393280 | consumed tokens: 31525437440 | elapsed time per iteration (s): 1.05 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.992879E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.779 | TFLOPs: 40.45 | 15: iteration 60140/ 125429 | consumed samples: 15395840 | consumed tokens: 31530680320 | elapsed time per iteration (s): 1.05 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.998846E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.768 | TFLOPs: 40.12 | 15: iteration 60150/ 125429 | consumed samples: 15398400 | consumed tokens: 31535923200 | elapsed time per iteration (s): 1.05 | learning rate: 1.173E-04 | global batch size: 256 | lm loss: 1.993678E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.724 | TFLOPs: 40.28 | 15: iteration 60160/ 125429 | consumed samples: 15400960 | consumed tokens: 31541166080 | elapsed time per iteration (s): 1.03 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 1.995591E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.043 | TFLOPs: 41.16 | 15: iteration 60170/ 125429 | consumed samples: 15403520 | consumed tokens: 31546408960 | elapsed time per iteration (s): 1.02 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 1.994145E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.941 | TFLOPs: 41.30 | 15: iteration 60180/ 125429 | consumed samples: 15406080 | consumed tokens: 31551651840 | elapsed time per iteration (s): 1.04 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 1.987236E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.379 | TFLOPs: 40.55 | 15: iteration 60190/ 125429 | consumed samples: 15408640 | consumed tokens: 31556894720 | elapsed time per iteration (s): 1.02 | learning rate: 1.172E-04 | global batch size: 256 | lm loss: 2.002552E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.470 | TFLOPs: 41.56 | 15: iteration 60200/ 125429 | consumed samples: 15411200 | consumed tokens: 31562137600 | elapsed time per iteration (s): 1.05 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 2.021976E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.793 | TFLOPs: 40.29 | 15: iteration 60210/ 125429 | consumed samples: 15413760 | consumed tokens: 31567380480 | elapsed time per iteration (s): 1.05 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 1.968774E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.823 | TFLOPs: 40.46 | 15: iteration 60220/ 125429 | consumed samples: 15416320 | consumed tokens: 31572623360 | elapsed time per iteration (s): 1.03 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 2.016377E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.547 | TFLOPs: 41.24 | 15: iteration 60230/ 125429 | consumed samples: 15418880 | consumed tokens: 31577866240 | elapsed time per iteration (s): 1.03 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 1.979925E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.490 | TFLOPs: 41.06 | 15: iteration 60240/ 125429 | consumed samples: 15421440 | consumed tokens: 31583109120 | elapsed time per iteration (s): 1.06 | learning rate: 1.171E-04 | global batch size: 256 | lm loss: 1.973860E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.264 | TFLOPs: 39.87 | 15: iteration 60250/ 125429 | consumed samples: 15424000 | consumed tokens: 31588352000 | elapsed time per iteration (s): 1.06 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 1.977030E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.452 | TFLOPs: 39.90 | 15: iteration 60260/ 125429 | consumed samples: 15426560 | consumed tokens: 31593594880 | elapsed time per iteration (s): 1.04 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 1.980230E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.562 | TFLOPs: 40.75 | 15: iteration 60270/ 125429 | consumed samples: 15429120 | consumed tokens: 31598837760 | elapsed time per iteration (s): 1.05 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 2.017157E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.783 | TFLOPs: 40.45 | 15: iteration 60280/ 125429 | consumed samples: 15431680 | consumed tokens: 31604080640 | elapsed time per iteration (s): 1.05 | learning rate: 1.170E-04 | global batch size: 256 | lm loss: 1.966509E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.862 | TFLOPs: 40.13 | 15: iteration 60290/ 125429 | consumed samples: 15434240 | consumed tokens: 31609323520 | elapsed time per iteration (s): 1.04 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 1.970934E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.201 | TFLOPs: 40.85 | 15: iteration 60300/ 125429 | consumed samples: 15436800 | consumed tokens: 31614566400 | elapsed time per iteration (s): 1.03 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 1.999745E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.469 | TFLOPs: 41.06 | 15: iteration 60310/ 125429 | consumed samples: 15439360 | consumed tokens: 31619809280 | elapsed time per iteration (s): 1.04 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 1.965168E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.645 | TFLOPs: 40.59 | 15: iteration 60320/ 125429 | consumed samples: 15441920 | consumed tokens: 31625052160 | elapsed time per iteration (s): 1.03 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 1.981729E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.648 | TFLOPs: 40.93 | 15: iteration 60330/ 125429 | consumed samples: 15444480 | consumed tokens: 31630295040 | elapsed time per iteration (s): 1.07 | learning rate: 1.169E-04 | global batch size: 256 | lm loss: 2.010034E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.380 | TFLOPs: 39.39 | 15: iteration 60340/ 125429 | consumed samples: 15447040 | consumed tokens: 31635537920 | elapsed time per iteration (s): 1.05 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.984736E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.270 | TFLOPs: 40.20 | 15: iteration 60350/ 125429 | consumed samples: 15449600 | consumed tokens: 31640780800 | elapsed time per iteration (s): 1.03 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.988519E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.846 | TFLOPs: 40.96 | 15: iteration 60360/ 125429 | consumed samples: 15452160 | consumed tokens: 31646023680 | elapsed time per iteration (s): 1.13 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.983632E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.171 | TFLOPs: 37.38 | 15: iteration 60370/ 125429 | consumed samples: 15454720 | consumed tokens: 31651266560 | elapsed time per iteration (s): 1.10 | learning rate: 1.168E-04 | global batch size: 256 | lm loss: 1.993927E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.418 | TFLOPs: 38.57 | 15: iteration 60380/ 125429 | consumed samples: 15457280 | consumed tokens: 31656509440 | elapsed time per iteration (s): 1.05 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 1.969794E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.834 | TFLOPs: 40.30 | 15: iteration 60390/ 125429 | consumed samples: 15459840 | consumed tokens: 31661752320 | elapsed time per iteration (s): 1.03 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 2.023932E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.031 | TFLOPs: 40.99 | 15: iteration 60400/ 125429 | consumed samples: 15462400 | consumed tokens: 31666995200 | elapsed time per iteration (s): 1.03 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 1.988358E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.561 | TFLOPs: 41.08 | 15: iteration 60410/ 125429 | consumed samples: 15464960 | consumed tokens: 31672238080 | elapsed time per iteration (s): 1.06 | learning rate: 1.167E-04 | global batch size: 256 | lm loss: 1.999500E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.773 | TFLOPs: 39.95 | 15: iteration 60420/ 125429 | consumed samples: 15467520 | consumed tokens: 31677480960 | elapsed time per iteration (s): 1.05 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 1.979432E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.441 | TFLOPs: 40.23 | 15: iteration 60430/ 125429 | consumed samples: 15470080 | consumed tokens: 31682723840 | elapsed time per iteration (s): 1.04 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 1.968534E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.242 | TFLOPs: 40.53 | 15: iteration 60440/ 125429 | consumed samples: 15472640 | consumed tokens: 31687966720 | elapsed time per iteration (s): 1.06 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 1.988863E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.296 | TFLOPs: 40.04 | 15: iteration 60450/ 125429 | consumed samples: 15475200 | consumed tokens: 31693209600 | elapsed time per iteration (s): 1.03 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 1.985719E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.700 | TFLOPs: 40.93 | 15: iteration 60460/ 125429 | consumed samples: 15477760 | consumed tokens: 31698452480 | elapsed time per iteration (s): 1.04 | learning rate: 1.166E-04 | global batch size: 256 | lm loss: 2.008235E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.120 | TFLOPs: 40.67 | 15: iteration 60470/ 125429 | consumed samples: 15480320 | consumed tokens: 31703695360 | elapsed time per iteration (s): 1.07 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 2.001222E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.369 | TFLOPs: 39.56 | 15: iteration 60480/ 125429 | consumed samples: 15482880 | consumed tokens: 31708938240 | elapsed time per iteration (s): 1.02 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 1.994805E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.898 | TFLOPs: 41.46 | 15: iteration 60490/ 125429 | consumed samples: 15485440 | consumed tokens: 31714181120 | elapsed time per iteration (s): 1.05 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 1.984904E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.527 | TFLOPs: 40.41 | 15: iteration 60500/ 125429 | consumed samples: 15488000 | consumed tokens: 31719424000 | elapsed time per iteration (s): 1.04 | learning rate: 1.165E-04 | global batch size: 256 | lm loss: 1.986535E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.245 | TFLOPs: 40.86 | 15: iteration 60510/ 125429 | consumed samples: 15490560 | consumed tokens: 31724666880 | elapsed time per iteration (s): 1.05 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.973008E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.693 | TFLOPs: 40.44 | 15: iteration 60520/ 125429 | consumed samples: 15493120 | consumed tokens: 31729909760 | elapsed time per iteration (s): 1.03 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.949443E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.018 | TFLOPs: 40.99 | 15: iteration 60530/ 125429 | consumed samples: 15495680 | consumed tokens: 31735152640 | elapsed time per iteration (s): 1.04 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.968127E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.574 | TFLOPs: 40.58 | 15: iteration 60540/ 125429 | consumed samples: 15498240 | consumed tokens: 31740395520 | elapsed time per iteration (s): 1.04 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 1.981124E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.569 | TFLOPs: 40.58 | 15: iteration 60550/ 125429 | consumed samples: 15500800 | consumed tokens: 31745638400 | elapsed time per iteration (s): 1.02 | learning rate: 1.164E-04 | global batch size: 256 | lm loss: 2.024438E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.397 | TFLOPs: 41.38 | 15: iteration 60560/ 125429 | consumed samples: 15503360 | consumed tokens: 31750881280 | elapsed time per iteration (s): 1.05 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 1.973325E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.159 | TFLOPs: 40.18 | 15: iteration 60570/ 125429 | consumed samples: 15505920 | consumed tokens: 31756124160 | elapsed time per iteration (s): 1.06 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 1.999674E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.733 | TFLOPs: 39.95 | 15: iteration 60580/ 125429 | consumed samples: 15508480 | consumed tokens: 31761367040 | elapsed time per iteration (s): 1.05 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 1.976275E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.709 | TFLOPs: 40.11 | 15: iteration 60590/ 125429 | consumed samples: 15511040 | consumed tokens: 31766609920 | elapsed time per iteration (s): 1.07 | learning rate: 1.163E-04 | global batch size: 256 | lm loss: 1.982126E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.234 | TFLOPs: 39.70 | 15: iteration 60600/ 125429 | consumed samples: 15513600 | consumed tokens: 31771852800 | elapsed time per iteration (s): 1.03 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 1.979260E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.895 | TFLOPs: 40.97 | 15: iteration 60610/ 125429 | consumed samples: 15516160 | consumed tokens: 31777095680 | elapsed time per iteration (s): 1.06 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 2.017273E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.887 | TFLOPs: 39.97 | 15: iteration 60620/ 125429 | consumed samples: 15518720 | consumed tokens: 31782338560 | elapsed time per iteration (s): 1.04 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 1.979370E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.555 | TFLOPs: 40.58 | 15: iteration 60630/ 125429 | consumed samples: 15521280 | consumed tokens: 31787581440 | elapsed time per iteration (s): 1.06 | learning rate: 1.162E-04 | global batch size: 256 | lm loss: 1.998161E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.891 | TFLOPs: 39.81 | 15: iteration 60640/ 125429 | consumed samples: 15523840 | consumed tokens: 31792824320 | elapsed time per iteration (s): 1.07 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.981200E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.099 | TFLOPs: 39.51 | 15: iteration 60650/ 125429 | consumed samples: 15526400 | consumed tokens: 31798067200 | elapsed time per iteration (s): 1.04 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.984660E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.217 | TFLOPs: 40.85 | 15: iteration 60660/ 125429 | consumed samples: 15528960 | consumed tokens: 31803310080 | elapsed time per iteration (s): 1.05 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.975958E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.518 | TFLOPs: 40.24 | 15: iteration 60670/ 125429 | consumed samples: 15531520 | consumed tokens: 31808552960 | elapsed time per iteration (s): 1.05 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.971210E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.077 | TFLOPs: 40.34 | 15: iteration 60680/ 125429 | consumed samples: 15534080 | consumed tokens: 31813795840 | elapsed time per iteration (s): 1.05 | learning rate: 1.161E-04 | global batch size: 256 | lm loss: 1.991897E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.225 | TFLOPs: 40.19 | 15: iteration 60690/ 125429 | consumed samples: 15536640 | consumed tokens: 31819038720 | elapsed time per iteration (s): 1.07 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 1.996743E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.120 | TFLOPs: 39.68 | 15: iteration 60700/ 125429 | consumed samples: 15539200 | consumed tokens: 31824281600 | elapsed time per iteration (s): 1.04 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 1.999151E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.988 | TFLOPs: 40.49 | 15: iteration 60710/ 125429 | consumed samples: 15541760 | consumed tokens: 31829524480 | elapsed time per iteration (s): 1.17 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 1.989939E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.769 | TFLOPs: 36.15 | 15: iteration 60720/ 125429 | consumed samples: 15544320 | consumed tokens: 31834767360 | elapsed time per iteration (s): 1.03 | learning rate: 1.160E-04 | global batch size: 256 | lm loss: 2.004819E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.100 | TFLOPs: 41.00 | 15: iteration 60730/ 125429 | consumed samples: 15546880 | consumed tokens: 31840010240 | elapsed time per iteration (s): 1.07 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 1.991807E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.091 | TFLOPs: 39.68 | 15: iteration 60740/ 125429 | consumed samples: 15549440 | consumed tokens: 31845253120 | elapsed time per iteration (s): 1.04 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 1.989741E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.080 | TFLOPs: 40.50 | 15: iteration 60750/ 125429 | consumed samples: 15552000 | consumed tokens: 31850496000 | elapsed time per iteration (s): 1.09 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 1.976280E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.961 | TFLOPs: 38.83 | 15: iteration 60760/ 125429 | consumed samples: 15554560 | consumed tokens: 31855738880 | elapsed time per iteration (s): 1.03 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 2.013768E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.442 | TFLOPs: 41.22 | 15: iteration 60770/ 125429 | consumed samples: 15557120 | consumed tokens: 31860981760 | elapsed time per iteration (s): 1.04 | learning rate: 1.159E-04 | global batch size: 256 | lm loss: 1.997881E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.247 | TFLOPs: 40.53 | 15: iteration 60780/ 125429 | consumed samples: 15559680 | consumed tokens: 31866224640 | elapsed time per iteration (s): 1.03 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 1.974931E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.758 | TFLOPs: 40.94 | 15: iteration 60790/ 125429 | consumed samples: 15562240 | consumed tokens: 31871467520 | elapsed time per iteration (s): 1.05 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 1.976401E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.676 | TFLOPs: 40.43 | 15: iteration 60800/ 125429 | consumed samples: 15564800 | consumed tokens: 31876710400 | elapsed time per iteration (s): 1.04 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 1.971051E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.968 | TFLOPs: 40.81 | 15: iteration 60810/ 125429 | consumed samples: 15567360 | consumed tokens: 31881953280 | elapsed time per iteration (s): 1.05 | learning rate: 1.158E-04 | global batch size: 256 | lm loss: 1.987026E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.743 | TFLOPs: 40.45 | 15: iteration 60820/ 125429 | consumed samples: 15569920 | consumed tokens: 31887196160 | elapsed time per iteration (s): 1.06 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 1.990296E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.595 | TFLOPs: 39.76 | 15: iteration 60830/ 125429 | consumed samples: 15572480 | consumed tokens: 31892439040 | elapsed time per iteration (s): 1.07 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 1.979635E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.177 | TFLOPs: 39.53 | 15: iteration 60840/ 125429 | consumed samples: 15575040 | consumed tokens: 31897681920 | elapsed time per iteration (s): 1.07 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 1.974225E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.188 | TFLOPs: 39.53 | 15: iteration 60850/ 125429 | consumed samples: 15577600 | consumed tokens: 31902924800 | elapsed time per iteration (s): 1.04 | learning rate: 1.157E-04 | global batch size: 256 | lm loss: 1.990713E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.680 | TFLOPs: 40.77 | 15: iteration 60860/ 125429 | consumed samples: 15580160 | consumed tokens: 31908167680 | elapsed time per iteration (s): 1.05 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 1.957195E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.742 | TFLOPs: 40.28 | 15: iteration 60870/ 125429 | consumed samples: 15582720 | consumed tokens: 31913410560 | elapsed time per iteration (s): 1.04 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 2.009332E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.572 | TFLOPs: 40.58 | 15: iteration 60880/ 125429 | consumed samples: 15585280 | consumed tokens: 31918653440 | elapsed time per iteration (s): 1.09 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 1.973438E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.879 | TFLOPs: 38.82 | 15: iteration 60890/ 125429 | consumed samples: 15587840 | consumed tokens: 31923896320 | elapsed time per iteration (s): 1.05 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 2.008863E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.766 | TFLOPs: 40.12 | 15: iteration 60900/ 125429 | consumed samples: 15590400 | consumed tokens: 31929139200 | elapsed time per iteration (s): 1.08 | learning rate: 1.156E-04 | global batch size: 256 | lm loss: 2.003028E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.511 | TFLOPs: 39.09 | 15: iteration 60910/ 125429 | consumed samples: 15592960 | consumed tokens: 31934382080 | elapsed time per iteration (s): 1.07 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 1.999358E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.173 | TFLOPs: 39.53 | 15: iteration 60920/ 125429 | consumed samples: 15595520 | consumed tokens: 31939624960 | elapsed time per iteration (s): 1.08 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 1.994431E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.952 | TFLOPs: 39.16 | 15: iteration 60930/ 125429 | consumed samples: 15598080 | consumed tokens: 31944867840 | elapsed time per iteration (s): 1.08 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 1.976406E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.144 | TFLOPs: 39.02 | 15: iteration 60940/ 125429 | consumed samples: 15600640 | consumed tokens: 31950110720 | elapsed time per iteration (s): 1.04 | learning rate: 1.155E-04 | global batch size: 256 | lm loss: 1.988498E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.880 | TFLOPs: 40.63 | 15: iteration 60950/ 125429 | consumed samples: 15603200 | consumed tokens: 31955353600 | elapsed time per iteration (s): 1.06 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 1.973246E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.419 | TFLOPs: 39.90 | 15: iteration 60960/ 125429 | consumed samples: 15605760 | consumed tokens: 31960596480 | elapsed time per iteration (s): 1.06 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 1.964527E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.114 | TFLOPs: 39.85 | 15: iteration 60970/ 125429 | consumed samples: 15608320 | consumed tokens: 31965839360 | elapsed time per iteration (s): 1.03 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 1.988744E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.974 | TFLOPs: 40.98 | 15: iteration 60980/ 125429 | consumed samples: 15610880 | consumed tokens: 31971082240 | elapsed time per iteration (s): 1.06 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 2.001664E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.783 | TFLOPs: 39.96 | 15: iteration 60990/ 125429 | consumed samples: 15613440 | consumed tokens: 31976325120 | elapsed time per iteration (s): 1.06 | learning rate: 1.154E-04 | global batch size: 256 | lm loss: 1.996263E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.512 | TFLOPs: 40.08 | 15: iteration 61000/ 125429 | consumed samples: 15616000 | consumed tokens: 31981568000 | elapsed time per iteration (s): 1.08 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 1.963855E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.807 | TFLOPs: 39.13 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 61000 | lm loss value: 1.959888E+00 | lm loss PPL: 7.098528E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 61000 to checkpoints_1b5 0: [2022-11-26 14:09:04,779] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step61000 is begin to save! 0: [2022-11-26 14:09:04,786] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:09:05,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:09:05,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:09:05,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:09:05,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:09:05,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:09:05,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:09:05,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:09:05,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:09:05,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:09:05,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:09:05,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:09:05,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:09:05,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:09:05,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:09:05,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:09:05,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:09:05,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:09:05,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:09:05,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:09:05,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:09:06,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:09:06,060] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:09:06,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:09:06,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:09:06,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:09:06,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:09:06,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:09:06,367] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:09:06,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:09:06,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:09:06,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:09:06,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:09:06,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:09:06,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:09:06,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:09:06,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:09:06,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:09:06,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:09:06,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:09:06,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:09:07,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:09:07,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:09:07,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:09:07,195] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:09:07,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:09:07,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:09:07,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:09:07,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:09:07,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:09:07,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:09:07,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:09:07,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:09:07,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:09:07,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_29-model_00-model_states.pt... 0: [2022-11-26 14:09:07,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_29-model_00-model_states.pt. 0: [2022-11-26 14:09:07,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:09:07,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:09:07,924] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/layer_32-model_00-model_states.pt... 0: [2022-11-26 14:09:07,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/layer_32-model_00-model_states.pt. 0: [2022-11-26 14:09:07,930] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step61000/mp_rank_00_model_states.pt 0: [2022-11-26 14:09:07,930] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:09:07,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:09:07,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step61000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:09:08,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 8: [2022-11-26 14:09:08,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 14:09:08,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:09:08,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 14:09:08,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:09:08,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:09:08,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 14:09:08,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:09:08,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 14:09:08,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 14:09:08,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:09:08,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:09:08,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:09:08,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 14:09:08,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:09:08,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:09:08,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:09:08,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 14:09:08,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 14:09:08,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:09:08,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:09:08,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:09:08,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 2: [2022-11-26 14:09:08,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 4: [2022-11-26 14:09:08,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:09:08,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 14:09:08,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:09:08,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 14:09:08,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 14:09:08,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:09:08,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:09:08,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:09:08,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 14:09:08,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 14:09:08,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 14:09:08,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 14:09:08,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:09:08,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 14:09:08,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 14:09:08,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 14:09:08,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:09:08,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 14:09:08,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:09:08,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:09:08,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 14:09:08,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 14:09:08,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 14:09:08,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:09:08,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:09:08,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 14:09:08,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 14:09:08,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:09:08,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 14:09:08,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:09:08,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 14:09:08,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:09:08,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 14:09:08,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:09:08,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 15: [2022-11-26 14:09:08,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:09:08,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:09:08,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 7: [2022-11-26 14:09:08,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:09:08,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:09:08,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:09:08,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 9: [2022-11-26 14:09:08,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:09:08,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:09:08,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 2: [2022-11-26 14:09:08,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:09:08,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 14:09:08,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 14: [2022-11-26 14:09:08,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:09:08,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 14:09:08,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 8: [2022-11-26 14:09:08,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:09:08,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 14:09:08,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:09:08,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 6: [2022-11-26 14:09:08,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:09:08,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:09:08,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 12: [2022-11-26 14:09:08,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:09:08,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 14:09:08,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 14:09:08,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:09:08,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 13: [2022-11-26 14:09:08,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:09:08,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:09:08,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 1: [2022-11-26 14:09:08,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:09:08,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:09:08,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:09:08,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 4: [2022-11-26 14:09:08,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:09:08,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:09:08,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 14:09:08,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 5: [2022-11-26 14:09:08,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:09:08,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 14:09:08,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 10: [2022-11-26 14:09:08,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:09:08,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:09:08,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: [2022-11-26 14:09:08,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:09:08,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 14:09:08,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 3: [2022-11-26 14:09:08,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:09:08,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:09:08,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 11: [2022-11-26 14:09:08,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:09:08,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step61000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 14:09:08,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step61000 is ready now! 0: successfully saved checkpoint at iteration 61000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3683.68 15: iteration 61010/ 125429 | consumed samples: 15618560 | consumed tokens: 31986810880 | elapsed time per iteration (s): 1.58 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 1.963797E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 161.838 | TFLOPs: 26.75 | 15: iteration 61020/ 125429 | consumed samples: 15621120 | consumed tokens: 31992053760 | elapsed time per iteration (s): 1.04 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 1.985396E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.262 | TFLOPs: 40.86 | 15: iteration 61030/ 125429 | consumed samples: 15623680 | consumed tokens: 31997296640 | elapsed time per iteration (s): 1.03 | learning rate: 1.153E-04 | global batch size: 256 | lm loss: 2.006405E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.746 | TFLOPs: 41.27 | 15: iteration 61040/ 125429 | consumed samples: 15626240 | consumed tokens: 32002539520 | elapsed time per iteration (s): 1.05 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 1.976018E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.056 | TFLOPs: 40.33 | 15: iteration 61050/ 125429 | consumed samples: 15628800 | consumed tokens: 32007782400 | elapsed time per iteration (s): 1.04 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 1.986838E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.038 | TFLOPs: 40.66 | 15: iteration 61060/ 125429 | consumed samples: 15631360 | consumed tokens: 32013025280 | elapsed time per iteration (s): 1.11 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 1.983378E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.224 | TFLOPs: 38.21 | 15: iteration 61070/ 125429 | consumed samples: 15633920 | consumed tokens: 32018268160 | elapsed time per iteration (s): 1.07 | learning rate: 1.152E-04 | global batch size: 256 | lm loss: 1.971459E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.240 | TFLOPs: 39.70 | 15: iteration 61080/ 125429 | consumed samples: 15636480 | consumed tokens: 32023511040 | elapsed time per iteration (s): 1.08 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 2.006675E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.060 | TFLOPs: 39.18 | 15: iteration 61090/ 125429 | consumed samples: 15639040 | consumed tokens: 32028753920 | elapsed time per iteration (s): 1.07 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 1.992227E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.170 | TFLOPs: 39.52 | 15: iteration 61100/ 125429 | consumed samples: 15641600 | consumed tokens: 32033996800 | elapsed time per iteration (s): 1.05 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 2.005721E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.388 | TFLOPs: 40.22 | 15: iteration 61110/ 125429 | consumed samples: 15644160 | consumed tokens: 32039239680 | elapsed time per iteration (s): 1.06 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 1.982274E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.161 | TFLOPs: 39.85 | 15: iteration 61120/ 125429 | consumed samples: 15646720 | consumed tokens: 32044482560 | elapsed time per iteration (s): 1.07 | learning rate: 1.151E-04 | global batch size: 256 | lm loss: 1.962914E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.673 | TFLOPs: 39.61 | 15: iteration 61130/ 125429 | consumed samples: 15649280 | consumed tokens: 32049725440 | elapsed time per iteration (s): 1.03 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.972639E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.135 | TFLOPs: 41.17 | 15: iteration 61140/ 125429 | consumed samples: 15651840 | consumed tokens: 32054968320 | elapsed time per iteration (s): 1.02 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.963426E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.769 | TFLOPs: 41.44 | 15: iteration 61150/ 125429 | consumed samples: 15654400 | consumed tokens: 32060211200 | elapsed time per iteration (s): 1.05 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.992137E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.011 | TFLOPs: 40.32 | 15: iteration 61160/ 125429 | consumed samples: 15656960 | consumed tokens: 32065454080 | elapsed time per iteration (s): 1.07 | learning rate: 1.150E-04 | global batch size: 256 | lm loss: 1.981222E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.912 | TFLOPs: 39.48 | 15: iteration 61170/ 125429 | consumed samples: 15659520 | consumed tokens: 32070696960 | elapsed time per iteration (s): 1.05 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.997984E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.400 | TFLOPs: 40.22 | 15: iteration 61180/ 125429 | consumed samples: 15662080 | consumed tokens: 32075939840 | elapsed time per iteration (s): 1.08 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 2.001590E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.476 | TFLOPs: 39.24 | 15: iteration 61190/ 125429 | consumed samples: 15664640 | consumed tokens: 32081182720 | elapsed time per iteration (s): 1.03 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.995344E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.185 | TFLOPs: 41.01 | 15: iteration 61200/ 125429 | consumed samples: 15667200 | consumed tokens: 32086425600 | elapsed time per iteration (s): 1.07 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 1.994452E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.074 | TFLOPs: 39.67 | 15: iteration 61210/ 125429 | consumed samples: 15669760 | consumed tokens: 32091668480 | elapsed time per iteration (s): 1.05 | learning rate: 1.149E-04 | global batch size: 256 | lm loss: 2.001840E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.029 | TFLOPs: 40.16 | 15: iteration 61220/ 125429 | consumed samples: 15672320 | consumed tokens: 32096911360 | elapsed time per iteration (s): 1.07 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.991943E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.894 | TFLOPs: 39.48 | 15: iteration 61230/ 125429 | consumed samples: 15674880 | consumed tokens: 32102154240 | elapsed time per iteration (s): 1.05 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.984586E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.511 | TFLOPs: 40.41 | 15: iteration 61240/ 125429 | consumed samples: 15677440 | consumed tokens: 32107397120 | elapsed time per iteration (s): 1.06 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.985667E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.578 | TFLOPs: 39.92 | 15: iteration 61250/ 125429 | consumed samples: 15680000 | consumed tokens: 32112640000 | elapsed time per iteration (s): 1.04 | learning rate: 1.148E-04 | global batch size: 256 | lm loss: 1.981977E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.162 | TFLOPs: 40.85 | 15: iteration 61260/ 125429 | consumed samples: 15682560 | consumed tokens: 32117882880 | elapsed time per iteration (s): 1.04 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 1.986222E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.944 | TFLOPs: 40.64 | 15: iteration 61270/ 125429 | consumed samples: 15685120 | consumed tokens: 32123125760 | elapsed time per iteration (s): 1.03 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 1.955190E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.194 | TFLOPs: 41.02 | 15: iteration 61280/ 125429 | consumed samples: 15687680 | consumed tokens: 32128368640 | elapsed time per iteration (s): 1.02 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 1.991444E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.891 | TFLOPs: 41.30 | 15: iteration 61290/ 125429 | consumed samples: 15690240 | consumed tokens: 32133611520 | elapsed time per iteration (s): 1.02 | learning rate: 1.147E-04 | global batch size: 256 | lm loss: 1.980640E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.766 | TFLOPs: 41.44 | 15: iteration 61300/ 125429 | consumed samples: 15692800 | consumed tokens: 32138854400 | elapsed time per iteration (s): 1.03 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 1.973183E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.589 | TFLOPs: 41.25 | 15: iteration 61310/ 125429 | consumed samples: 15695360 | consumed tokens: 32144097280 | elapsed time per iteration (s): 1.06 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 1.991500E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.808 | TFLOPs: 39.80 | 15: iteration 61320/ 125429 | consumed samples: 15697920 | consumed tokens: 32149340160 | elapsed time per iteration (s): 1.05 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 1.992625E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.014 | TFLOPs: 40.33 | 15: iteration 61330/ 125429 | consumed samples: 15700480 | consumed tokens: 32154583040 | elapsed time per iteration (s): 1.05 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 2.007384E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.802 | TFLOPs: 40.46 | 15: iteration 61340/ 125429 | consumed samples: 15703040 | consumed tokens: 32159825920 | elapsed time per iteration (s): 1.06 | learning rate: 1.146E-04 | global batch size: 256 | lm loss: 1.969317E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.479 | TFLOPs: 39.91 | 15: iteration 61350/ 125429 | consumed samples: 15705600 | consumed tokens: 32165068800 | elapsed time per iteration (s): 1.04 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 1.984663E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.396 | TFLOPs: 40.55 | 15: iteration 61360/ 125429 | consumed samples: 15708160 | consumed tokens: 32170311680 | elapsed time per iteration (s): 1.06 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 1.974192E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.379 | TFLOPs: 40.06 | 15: iteration 61370/ 125429 | consumed samples: 15710720 | consumed tokens: 32175554560 | elapsed time per iteration (s): 1.06 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 2.003871E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.545 | TFLOPs: 40.08 | 15: iteration 61380/ 125429 | consumed samples: 15713280 | consumed tokens: 32180797440 | elapsed time per iteration (s): 1.07 | learning rate: 1.145E-04 | global batch size: 256 | lm loss: 1.977703E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.055 | TFLOPs: 39.51 | 15: iteration 61390/ 125429 | consumed samples: 15715840 | consumed tokens: 32186040320 | elapsed time per iteration (s): 1.25 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 2.008975E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 204.107 | TFLOPs: 33.73 | 15: iteration 61400/ 125429 | consumed samples: 15718400 | consumed tokens: 32191283200 | elapsed time per iteration (s): 1.04 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 1.981824E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.065 | TFLOPs: 40.66 | 15: iteration 61410/ 125429 | consumed samples: 15720960 | consumed tokens: 32196526080 | elapsed time per iteration (s): 1.06 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 1.983607E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.888 | TFLOPs: 39.97 | 15: iteration 61420/ 125429 | consumed samples: 15723520 | consumed tokens: 32201768960 | elapsed time per iteration (s): 1.05 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 1.993700E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.883 | TFLOPs: 40.47 | 15: iteration 61430/ 125429 | consumed samples: 15726080 | consumed tokens: 32207011840 | elapsed time per iteration (s): 1.07 | learning rate: 1.144E-04 | global batch size: 256 | lm loss: 1.994466E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.589 | TFLOPs: 39.59 | 15: iteration 61440/ 125429 | consumed samples: 15728640 | consumed tokens: 32212254720 | elapsed time per iteration (s): 1.08 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 1.973325E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.100 | TFLOPs: 39.18 | 15: iteration 61450/ 125429 | consumed samples: 15731200 | consumed tokens: 32217497600 | elapsed time per iteration (s): 1.07 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 1.946152E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.377 | TFLOPs: 39.39 | 15: iteration 61460/ 125429 | consumed samples: 15733760 | consumed tokens: 32222740480 | elapsed time per iteration (s): 1.04 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 2.012030E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.133 | TFLOPs: 40.68 | 15: iteration 61470/ 125429 | consumed samples: 15736320 | consumed tokens: 32227983360 | elapsed time per iteration (s): 1.03 | learning rate: 1.143E-04 | global batch size: 256 | lm loss: 1.989894E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.618 | TFLOPs: 40.92 | 15: iteration 61480/ 125429 | consumed samples: 15738880 | consumed tokens: 32233226240 | elapsed time per iteration (s): 1.04 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 1.974949E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.791 | TFLOPs: 40.78 | 15: iteration 61490/ 125429 | consumed samples: 15741440 | consumed tokens: 32238469120 | elapsed time per iteration (s): 1.05 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 1.960586E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.498 | TFLOPs: 40.24 | 15: iteration 61500/ 125429 | consumed samples: 15744000 | consumed tokens: 32243712000 | elapsed time per iteration (s): 1.04 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 1.965501E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.562 | TFLOPs: 40.75 | 15: iteration 61510/ 125429 | consumed samples: 15746560 | consumed tokens: 32248954880 | elapsed time per iteration (s): 1.06 | learning rate: 1.142E-04 | global batch size: 256 | lm loss: 1.982581E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.477 | TFLOPs: 39.91 | 15: iteration 61520/ 125429 | consumed samples: 15749120 | consumed tokens: 32254197760 | elapsed time per iteration (s): 1.04 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.967688E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.001 | TFLOPs: 40.82 | 15: iteration 61530/ 125429 | consumed samples: 15751680 | consumed tokens: 32259440640 | elapsed time per iteration (s): 1.09 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 2.002720E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.844 | TFLOPs: 38.81 | 15: iteration 61540/ 125429 | consumed samples: 15754240 | consumed tokens: 32264683520 | elapsed time per iteration (s): 1.04 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.996378E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.578 | TFLOPs: 40.75 | 15: iteration 61550/ 125429 | consumed samples: 15756800 | consumed tokens: 32269926400 | elapsed time per iteration (s): 1.08 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.960699E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.628 | TFLOPs: 39.10 | 15: iteration 61560/ 125429 | consumed samples: 15759360 | consumed tokens: 32275169280 | elapsed time per iteration (s): 1.06 | learning rate: 1.141E-04 | global batch size: 256 | lm loss: 1.979426E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.627 | TFLOPs: 40.10 | 15: iteration 61570/ 125429 | consumed samples: 15761920 | consumed tokens: 32280412160 | elapsed time per iteration (s): 1.06 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 1.996894E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.621 | TFLOPs: 39.93 | 15: iteration 61580/ 125429 | consumed samples: 15764480 | consumed tokens: 32285655040 | elapsed time per iteration (s): 1.05 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 1.977470E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.379 | TFLOPs: 40.22 | 15: iteration 61590/ 125429 | consumed samples: 15767040 | consumed tokens: 32290897920 | elapsed time per iteration (s): 1.04 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 2.009450E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.230 | TFLOPs: 40.53 | 15: iteration 61600/ 125429 | consumed samples: 15769600 | consumed tokens: 32296140800 | elapsed time per iteration (s): 1.09 | learning rate: 1.140E-04 | global batch size: 256 | lm loss: 2.001366E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.967 | TFLOPs: 38.83 | 15: iteration 61610/ 125429 | consumed samples: 15772160 | consumed tokens: 32301383680 | elapsed time per iteration (s): 1.09 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 1.960691E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.837 | TFLOPs: 38.81 | 15: iteration 61620/ 125429 | consumed samples: 15774720 | consumed tokens: 32306626560 | elapsed time per iteration (s): 1.04 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 1.969831E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.920 | TFLOPs: 40.64 | 15: iteration 61630/ 125429 | consumed samples: 15777280 | consumed tokens: 32311869440 | elapsed time per iteration (s): 1.08 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 2.005162E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.843 | TFLOPs: 39.14 | 15: iteration 61640/ 125429 | consumed samples: 15779840 | consumed tokens: 32317112320 | elapsed time per iteration (s): 1.11 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 1.990229E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.746 | TFLOPs: 37.97 | 15: iteration 61650/ 125429 | consumed samples: 15782400 | consumed tokens: 32322355200 | elapsed time per iteration (s): 1.03 | learning rate: 1.139E-04 | global batch size: 256 | lm loss: 2.028892E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.348 | TFLOPs: 41.04 | 15: iteration 61660/ 125429 | consumed samples: 15784960 | consumed tokens: 32327598080 | elapsed time per iteration (s): 1.07 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 1.962010E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.759 | TFLOPs: 39.46 | 15: iteration 61670/ 125429 | consumed samples: 15787520 | consumed tokens: 32332840960 | elapsed time per iteration (s): 1.06 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 1.970163E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.886 | TFLOPs: 39.97 | 15: iteration 61680/ 125429 | consumed samples: 15790080 | consumed tokens: 32338083840 | elapsed time per iteration (s): 1.03 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 1.992303E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.612 | TFLOPs: 40.92 | 15: iteration 61690/ 125429 | consumed samples: 15792640 | consumed tokens: 32343326720 | elapsed time per iteration (s): 1.12 | learning rate: 1.138E-04 | global batch size: 256 | lm loss: 1.968316E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.926 | TFLOPs: 37.83 | 15: iteration 61700/ 125429 | consumed samples: 15795200 | consumed tokens: 32348569600 | elapsed time per iteration (s): 1.03 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 2.005646E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.534 | TFLOPs: 40.91 | 15: iteration 61710/ 125429 | consumed samples: 15797760 | consumed tokens: 32353812480 | elapsed time per iteration (s): 3.11 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 1.990311E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 82.257 | TFLOPs: 13.59 | 15: iteration 61720/ 125429 | consumed samples: 15800320 | consumed tokens: 32359055360 | elapsed time per iteration (s): 1.19 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 1.969242E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.933 | TFLOPs: 35.52 | 15: iteration 61730/ 125429 | consumed samples: 15802880 | consumed tokens: 32364298240 | elapsed time per iteration (s): 1.07 | learning rate: 1.137E-04 | global batch size: 256 | lm loss: 1.990275E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.374 | TFLOPs: 39.72 | 15: iteration 61740/ 125429 | consumed samples: 15805440 | consumed tokens: 32369541120 | elapsed time per iteration (s): 1.05 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.986489E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.526 | TFLOPs: 40.41 | 15: iteration 61750/ 125429 | consumed samples: 15808000 | consumed tokens: 32374784000 | elapsed time per iteration (s): 1.07 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.994444E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.282 | TFLOPs: 39.54 | 15: iteration 61760/ 125429 | consumed samples: 15810560 | consumed tokens: 32380026880 | elapsed time per iteration (s): 1.04 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.986699E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.766 | TFLOPs: 40.61 | 15: iteration 61770/ 125429 | consumed samples: 15813120 | consumed tokens: 32385269760 | elapsed time per iteration (s): 1.06 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.989209E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.742 | TFLOPs: 39.95 | 15: iteration 61780/ 125429 | consumed samples: 15815680 | consumed tokens: 32390512640 | elapsed time per iteration (s): 1.04 | learning rate: 1.136E-04 | global batch size: 256 | lm loss: 1.988887E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.090 | TFLOPs: 40.83 | 15: iteration 61790/ 125429 | consumed samples: 15818240 | consumed tokens: 32395755520 | elapsed time per iteration (s): 1.10 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 1.983442E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.765 | TFLOPs: 38.47 | 15: iteration 61800/ 125429 | consumed samples: 15820800 | consumed tokens: 32400998400 | elapsed time per iteration (s): 1.07 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 1.985985E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.799 | TFLOPs: 39.63 | 15: iteration 61810/ 125429 | consumed samples: 15823360 | consumed tokens: 32406241280 | elapsed time per iteration (s): 1.06 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 1.988081E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.396 | TFLOPs: 39.89 | 15: iteration 61820/ 125429 | consumed samples: 15825920 | consumed tokens: 32411484160 | elapsed time per iteration (s): 1.06 | learning rate: 1.135E-04 | global batch size: 256 | lm loss: 1.981162E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.659 | TFLOPs: 39.77 | 15: iteration 61830/ 125429 | consumed samples: 15828480 | consumed tokens: 32416727040 | elapsed time per iteration (s): 1.06 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 1.995568E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.376 | TFLOPs: 39.72 | 15: iteration 61840/ 125429 | consumed samples: 15831040 | consumed tokens: 32421969920 | elapsed time per iteration (s): 1.07 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 1.982476E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.669 | TFLOPs: 39.61 | 15: iteration 61850/ 125429 | consumed samples: 15833600 | consumed tokens: 32427212800 | elapsed time per iteration (s): 1.05 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 1.989406E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.714 | TFLOPs: 40.44 | 15: iteration 61860/ 125429 | consumed samples: 15836160 | consumed tokens: 32432455680 | elapsed time per iteration (s): 1.07 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 1.983450E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.664 | TFLOPs: 39.44 | 15: iteration 61870/ 125429 | consumed samples: 15838720 | consumed tokens: 32437698560 | elapsed time per iteration (s): 1.05 | learning rate: 1.134E-04 | global batch size: 256 | lm loss: 1.984304E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.153 | TFLOPs: 40.35 | 15: iteration 61880/ 125429 | consumed samples: 15841280 | consumed tokens: 32442941440 | elapsed time per iteration (s): 1.05 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 2.012815E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.945 | TFLOPs: 40.48 | 15: iteration 61890/ 125429 | consumed samples: 15843840 | consumed tokens: 32448184320 | elapsed time per iteration (s): 1.06 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 1.998400E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.898 | TFLOPs: 39.81 | 15: iteration 61900/ 125429 | consumed samples: 15846400 | consumed tokens: 32453427200 | elapsed time per iteration (s): 1.06 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 2.005293E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.565 | TFLOPs: 40.09 | 15: iteration 61910/ 125429 | consumed samples: 15848960 | consumed tokens: 32458670080 | elapsed time per iteration (s): 1.07 | learning rate: 1.133E-04 | global batch size: 256 | lm loss: 1.983759E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.736 | TFLOPs: 39.62 | 15: iteration 61920/ 125429 | consumed samples: 15851520 | consumed tokens: 32463912960 | elapsed time per iteration (s): 1.03 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 1.962645E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.609 | TFLOPs: 40.92 | 15: iteration 61930/ 125429 | consumed samples: 15854080 | consumed tokens: 32469155840 | elapsed time per iteration (s): 1.04 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 1.987710E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.508 | TFLOPs: 40.57 | 15: iteration 61940/ 125429 | consumed samples: 15856640 | consumed tokens: 32474398720 | elapsed time per iteration (s): 1.06 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 1.988708E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.880 | TFLOPs: 39.81 | 15: iteration 61950/ 125429 | consumed samples: 15859200 | consumed tokens: 32479641600 | elapsed time per iteration (s): 1.10 | learning rate: 1.132E-04 | global batch size: 256 | lm loss: 1.998962E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.150 | TFLOPs: 38.53 | 15: iteration 61960/ 125429 | consumed samples: 15861760 | consumed tokens: 32484884480 | elapsed time per iteration (s): 1.07 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 1.982621E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.975 | TFLOPs: 39.49 | 15: iteration 61970/ 125429 | consumed samples: 15864320 | consumed tokens: 32490127360 | elapsed time per iteration (s): 1.03 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 1.998725E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.708 | TFLOPs: 40.94 | 15: iteration 61980/ 125429 | consumed samples: 15866880 | consumed tokens: 32495370240 | elapsed time per iteration (s): 1.07 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 1.989720E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.918 | TFLOPs: 39.65 | 15: iteration 61990/ 125429 | consumed samples: 15869440 | consumed tokens: 32500613120 | elapsed time per iteration (s): 1.09 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 1.979421E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.535 | TFLOPs: 38.92 | 0: [2022-11-26 14:27:10,441] [INFO] [logging.py:68:log_dist] [Rank 0] step=62000, skipped=0, lr=[0.00011305437647153478, 0.00011305437647153478, 0.00011305437647153478], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 62000/ 125429 | consumed samples: 15872000 | consumed tokens: 32505856000 | elapsed time per iteration (s): 1.07 | learning rate: 1.131E-04 | global batch size: 256 | lm loss: 1.975744E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.248 | TFLOPs: 39.54 | 0: steps: 62000 loss: 2.0176 iter time (s): 1.065 samples/sec: 240.451 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 62000 | lm loss value: 1.988142E+00 | lm loss PPL: 7.301953E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 62000 to checkpoints_1b5 0: [2022-11-26 14:27:10,821] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step62000 is begin to save! 0: [2022-11-26 14:27:10,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:27:11,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:27:11,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:27:11,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:27:11,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:27:11,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:27:11,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:27:11,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:27:11,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:27:11,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:27:11,503] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:27:11,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:27:11,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:27:11,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:27:11,703] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:27:11,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:27:11,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:27:11,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:27:11,908] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:27:12,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:27:12,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:27:12,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:27:12,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:27:12,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:27:12,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:27:12,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:27:12,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:27:12,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:27:12,415] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:27:12,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:27:12,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:27:12,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:27:12,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:27:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:27:12,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:27:12,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:27:12,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:27:12,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:27:12,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:27:13,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:27:13,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:27:13,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:27:13,129] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:27:13,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:27:13,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:27:13,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:27:13,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:27:13,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:27:13,431] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:27:13,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:27:13,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:27:13,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:27:13,633] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:27:13,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:27:13,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_29-model_00-model_states.pt... 0: [2022-11-26 14:27:13,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_29-model_00-model_states.pt. 0: [2022-11-26 14:27:13,839] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:27:13,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:27:13,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/layer_32-model_00-model_states.pt... 0: [2022-11-26 14:27:13,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/layer_32-model_00-model_states.pt. 0: [2022-11-26 14:27:13,946] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step62000/mp_rank_00_model_states.pt 0: [2022-11-26 14:27:13,946] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:27:13,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:27:13,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step62000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:27:14,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 7: [2022-11-26 14:27:14,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 14:27:14,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 14:27:14,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 14:27:14,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 14:27:14,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:27:14,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:27:14,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:27:14,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:27:14,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 14:27:14,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 14:27:14,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:27:14,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 14:27:14,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:27:14,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 14:27:14,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:27:14,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:27:14,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,157] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,157] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:27:14,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:27:14,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:27:14,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 7: [2022-11-26 14:27:14,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:27:14,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 14:27:14,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 14:27:14,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:27:14,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:27:14,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 14:27:14,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:27:14,166] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,166] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 14:27:14,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 14:27:14,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 14:27:14,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:27:14,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:27:14,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:27:14,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:27:14,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:27:14,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 0: [2022-11-26 14:27:14,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 14:27:14,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 14:27:14,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:27:14,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:27:14,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:27:14,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:27:14,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:27:14,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 14:27:14,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,180] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 14:27:14,180] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 15: [2022-11-26 14:27:14,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:27:14,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 14:27:14,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:27:14,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 14:27:14,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:27:14,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:27:14,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:27:14,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 14:27:14,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 14:27:14,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:27:14,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 14:27:14,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:27:14,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:27:14,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:27:14,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:27:14,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 14:27:14,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:27:14,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 11: [2022-11-26 14:27:14,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 11: [2022-11-26 14:27:14,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 4: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:27:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 14:27:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 7: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 8: [2022-11-26 14:27:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:27:14,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 14:27:14,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:27:14,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:27:14,194] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,194] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 2: [2022-11-26 14:27:14,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:27:14,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:27:14,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:27:14,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 6: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:27:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:27:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 1: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:27:14,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:27:14,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 14:27:14,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 3: [2022-11-26 14:27:14,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:27:14,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:27:14,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 10: [2022-11-26 14:27:14,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:27:14,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 14:27:14,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 14: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:27:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:27:14,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:27:14,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:27:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 13: [2022-11-26 14:27:14,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 5: [2022-11-26 14:27:14,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:27:14,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 14:27:14,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 12: [2022-11-26 14:27:14,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:27:14,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:27:14,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:27:14,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 9: [2022-11-26 14:27:14,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:27:14,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 14:27:14,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: [2022-11-26 14:27:14,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step62000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:27:14,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step62000 is ready now! 0: successfully saved checkpoint at iteration 62000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3469.11 15: iteration 62010/ 125429 | consumed samples: 15874560 | consumed tokens: 32511098880 | elapsed time per iteration (s): 1.44 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 1.982564E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.931 | TFLOPs: 29.40 | 15: iteration 62020/ 125429 | consumed samples: 15877120 | consumed tokens: 32516341760 | elapsed time per iteration (s): 1.09 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 1.963952E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.465 | TFLOPs: 38.91 | 15: iteration 62030/ 125429 | consumed samples: 15879680 | consumed tokens: 32521584640 | elapsed time per iteration (s): 1.05 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 1.971695E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.839 | TFLOPs: 40.13 | 15: iteration 62040/ 125429 | consumed samples: 15882240 | consumed tokens: 32526827520 | elapsed time per iteration (s): 1.08 | learning rate: 1.130E-04 | global batch size: 256 | lm loss: 1.980303E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.695 | TFLOPs: 39.28 | 15: iteration 62050/ 125429 | consumed samples: 15884800 | consumed tokens: 32532070400 | elapsed time per iteration (s): 1.08 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 1.969126E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.610 | TFLOPs: 39.10 | 15: iteration 62060/ 125429 | consumed samples: 15887360 | consumed tokens: 32537313280 | elapsed time per iteration (s): 1.04 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 1.989256E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.494 | TFLOPs: 40.74 | 15: iteration 62070/ 125429 | consumed samples: 15889920 | consumed tokens: 32542556160 | elapsed time per iteration (s): 1.09 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 1.999645E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.660 | TFLOPs: 38.94 | 15: iteration 62080/ 125429 | consumed samples: 15892480 | consumed tokens: 32547799040 | elapsed time per iteration (s): 1.11 | learning rate: 1.129E-04 | global batch size: 256 | lm loss: 2.007615E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.516 | TFLOPs: 38.09 | 15: iteration 62090/ 125429 | consumed samples: 15895040 | consumed tokens: 32553041920 | elapsed time per iteration (s): 1.06 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.988951E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.241 | TFLOPs: 40.03 | 15: iteration 62100/ 125429 | consumed samples: 15897600 | consumed tokens: 32558284800 | elapsed time per iteration (s): 1.05 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.952935E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.890 | TFLOPs: 40.47 | 15: iteration 62110/ 125429 | consumed samples: 15900160 | consumed tokens: 32563527680 | elapsed time per iteration (s): 1.04 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.972309E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.898 | TFLOPs: 40.64 | 15: iteration 62120/ 125429 | consumed samples: 15902720 | consumed tokens: 32568770560 | elapsed time per iteration (s): 1.04 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.990153E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.865 | TFLOPs: 40.63 | 15: iteration 62130/ 125429 | consumed samples: 15905280 | consumed tokens: 32574013440 | elapsed time per iteration (s): 1.05 | learning rate: 1.128E-04 | global batch size: 256 | lm loss: 1.998503E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.945 | TFLOPs: 40.31 | 15: iteration 62140/ 125429 | consumed samples: 15907840 | consumed tokens: 32579256320 | elapsed time per iteration (s): 1.06 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.981347E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.646 | TFLOPs: 40.10 | 15: iteration 62150/ 125429 | consumed samples: 15910400 | consumed tokens: 32584499200 | elapsed time per iteration (s): 1.06 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.982856E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.489 | TFLOPs: 40.07 | 15: iteration 62160/ 125429 | consumed samples: 15912960 | consumed tokens: 32589742080 | elapsed time per iteration (s): 1.05 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 1.968417E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.895 | TFLOPs: 40.31 | 15: iteration 62170/ 125429 | consumed samples: 15915520 | consumed tokens: 32594984960 | elapsed time per iteration (s): 1.10 | learning rate: 1.127E-04 | global batch size: 256 | lm loss: 2.001203E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.729 | TFLOPs: 38.63 | 15: iteration 62180/ 125429 | consumed samples: 15918080 | consumed tokens: 32600227840 | elapsed time per iteration (s): 1.06 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 2.003438E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.113 | TFLOPs: 40.01 | 15: iteration 62190/ 125429 | consumed samples: 15920640 | consumed tokens: 32605470720 | elapsed time per iteration (s): 1.03 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 2.003754E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.793 | TFLOPs: 41.11 | 15: iteration 62200/ 125429 | consumed samples: 15923200 | consumed tokens: 32610713600 | elapsed time per iteration (s): 1.04 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.970872E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.555 | TFLOPs: 40.58 | 15: iteration 62210/ 125429 | consumed samples: 15925760 | consumed tokens: 32615956480 | elapsed time per iteration (s): 1.04 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 1.985257E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.119 | TFLOPs: 40.51 | 15: iteration 62220/ 125429 | consumed samples: 15928320 | consumed tokens: 32621199360 | elapsed time per iteration (s): 1.08 | learning rate: 1.126E-04 | global batch size: 256 | lm loss: 2.012070E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.097 | TFLOPs: 39.35 | 15: iteration 62230/ 125429 | consumed samples: 15930880 | consumed tokens: 32626442240 | elapsed time per iteration (s): 1.05 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 1.988250E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.313 | TFLOPs: 40.21 | 15: iteration 62240/ 125429 | consumed samples: 15933440 | consumed tokens: 32631685120 | elapsed time per iteration (s): 1.06 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 1.994619E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.934 | TFLOPs: 39.98 | 15: iteration 62250/ 125429 | consumed samples: 15936000 | consumed tokens: 32636928000 | elapsed time per iteration (s): 1.05 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 1.969657E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.369 | TFLOPs: 40.22 | 15: iteration 62260/ 125429 | consumed samples: 15938560 | consumed tokens: 32642170880 | elapsed time per iteration (s): 1.04 | learning rate: 1.125E-04 | global batch size: 256 | lm loss: 1.989644E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.449 | TFLOPs: 40.56 | 15: iteration 62270/ 125429 | consumed samples: 15941120 | consumed tokens: 32647413760 | elapsed time per iteration (s): 1.05 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 1.964276E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.709 | TFLOPs: 40.11 | 15: iteration 62280/ 125429 | consumed samples: 15943680 | consumed tokens: 32652656640 | elapsed time per iteration (s): 1.05 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 1.998923E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.735 | TFLOPs: 40.44 | 15: iteration 62290/ 125429 | consumed samples: 15946240 | consumed tokens: 32657899520 | elapsed time per iteration (s): 1.05 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 1.994680E+00 | grad norm: 0.357 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.509 | TFLOPs: 40.24 | 15: iteration 62300/ 125429 | consumed samples: 15948800 | consumed tokens: 32663142400 | elapsed time per iteration (s): 1.04 | learning rate: 1.124E-04 | global batch size: 256 | lm loss: 2.001205E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.449 | TFLOPs: 40.73 | 15: iteration 62310/ 125429 | consumed samples: 15951360 | consumed tokens: 32668385280 | elapsed time per iteration (s): 1.04 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 2.025950E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.967 | TFLOPs: 40.81 | 15: iteration 62320/ 125429 | consumed samples: 15953920 | consumed tokens: 32673628160 | elapsed time per iteration (s): 1.07 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.982625E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.553 | TFLOPs: 39.59 | 15: iteration 62330/ 125429 | consumed samples: 15956480 | consumed tokens: 32678871040 | elapsed time per iteration (s): 1.04 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.998011E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.080 | TFLOPs: 40.67 | 15: iteration 62340/ 125429 | consumed samples: 15959040 | consumed tokens: 32684113920 | elapsed time per iteration (s): 1.04 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.974442E+00 | grad norm: 0.240 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.214 | TFLOPs: 40.52 | 15: iteration 62350/ 125429 | consumed samples: 15961600 | consumed tokens: 32689356800 | elapsed time per iteration (s): 1.05 | learning rate: 1.123E-04 | global batch size: 256 | lm loss: 1.992343E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.190 | TFLOPs: 40.19 | 15: iteration 62360/ 125429 | consumed samples: 15964160 | consumed tokens: 32694599680 | elapsed time per iteration (s): 1.09 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 2.006198E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.004 | TFLOPs: 38.84 | 15: iteration 62370/ 125429 | consumed samples: 15966720 | consumed tokens: 32699842560 | elapsed time per iteration (s): 1.07 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 1.998032E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.160 | TFLOPs: 39.52 | 15: iteration 62380/ 125429 | consumed samples: 15969280 | consumed tokens: 32705085440 | elapsed time per iteration (s): 1.04 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 1.990022E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.063 | TFLOPs: 40.66 | 15: iteration 62390/ 125429 | consumed samples: 15971840 | consumed tokens: 32710328320 | elapsed time per iteration (s): 1.07 | learning rate: 1.122E-04 | global batch size: 256 | lm loss: 1.978494E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.052 | TFLOPs: 39.51 | 15: iteration 62400/ 125429 | consumed samples: 15974400 | consumed tokens: 32715571200 | elapsed time per iteration (s): 1.10 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.980593E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.675 | TFLOPs: 38.45 | 15: iteration 62410/ 125429 | consumed samples: 15976960 | consumed tokens: 32720814080 | elapsed time per iteration (s): 1.03 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.980100E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.730 | TFLOPs: 41.10 | 15: iteration 62420/ 125429 | consumed samples: 15979520 | consumed tokens: 32726056960 | elapsed time per iteration (s): 1.05 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.999691E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.407 | TFLOPs: 40.39 | 15: iteration 62430/ 125429 | consumed samples: 15982080 | consumed tokens: 32731299840 | elapsed time per iteration (s): 1.03 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.988292E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.898 | TFLOPs: 41.13 | 15: iteration 62440/ 125429 | consumed samples: 15984640 | consumed tokens: 32736542720 | elapsed time per iteration (s): 1.05 | learning rate: 1.121E-04 | global batch size: 256 | lm loss: 1.986000E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.003 | TFLOPs: 40.32 | 15: iteration 62450/ 125429 | consumed samples: 15987200 | consumed tokens: 32741785600 | elapsed time per iteration (s): 1.04 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 1.979986E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.244 | TFLOPs: 40.53 | 15: iteration 62460/ 125429 | consumed samples: 15989760 | consumed tokens: 32747028480 | elapsed time per iteration (s): 1.03 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 1.958816E+00 | grad norm: 0.889 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.701 | TFLOPs: 41.10 | 15: iteration 62470/ 125429 | consumed samples: 15992320 | consumed tokens: 32752271360 | elapsed time per iteration (s): 1.05 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 1.980814E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.644 | TFLOPs: 40.43 | 15: iteration 62480/ 125429 | consumed samples: 15994880 | consumed tokens: 32757514240 | elapsed time per iteration (s): 1.06 | learning rate: 1.120E-04 | global batch size: 256 | lm loss: 1.976397E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.453 | TFLOPs: 39.90 | 15: iteration 62490/ 125429 | consumed samples: 15997440 | consumed tokens: 32762757120 | elapsed time per iteration (s): 1.05 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 2.009677E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.703 | TFLOPs: 40.44 | 15: iteration 62500/ 125429 | consumed samples: 16000000 | consumed tokens: 32768000000 | elapsed time per iteration (s): 1.08 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 1.994896E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.007 | TFLOPs: 39.33 | 15: iteration 62510/ 125429 | consumed samples: 16002560 | consumed tokens: 32773242880 | elapsed time per iteration (s): 1.03 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 1.994138E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.481 | TFLOPs: 41.23 | 15: iteration 62520/ 125429 | consumed samples: 16005120 | consumed tokens: 32778485760 | elapsed time per iteration (s): 1.04 | learning rate: 1.119E-04 | global batch size: 256 | lm loss: 1.971626E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.996 | TFLOPs: 40.65 | 15: iteration 62530/ 125429 | consumed samples: 16007680 | consumed tokens: 32783728640 | elapsed time per iteration (s): 1.08 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 2.003037E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.560 | TFLOPs: 39.26 | 15: iteration 62540/ 125429 | consumed samples: 16010240 | consumed tokens: 32788971520 | elapsed time per iteration (s): 1.23 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 2.018052E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 207.659 | TFLOPs: 34.32 | 15: iteration 62550/ 125429 | consumed samples: 16012800 | consumed tokens: 32794214400 | elapsed time per iteration (s): 1.04 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 1.965564E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.221 | TFLOPs: 40.52 | 15: iteration 62560/ 125429 | consumed samples: 16015360 | consumed tokens: 32799457280 | elapsed time per iteration (s): 1.04 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 2.027374E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.802 | TFLOPs: 40.62 | 15: iteration 62570/ 125429 | consumed samples: 16017920 | consumed tokens: 32804700160 | elapsed time per iteration (s): 1.05 | learning rate: 1.118E-04 | global batch size: 256 | lm loss: 2.003321E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.119 | TFLOPs: 40.18 | 15: iteration 62580/ 125429 | consumed samples: 16020480 | consumed tokens: 32809943040 | elapsed time per iteration (s): 1.04 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 1.964952E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.939 | TFLOPs: 40.81 | 15: iteration 62590/ 125429 | consumed samples: 16023040 | consumed tokens: 32815185920 | elapsed time per iteration (s): 1.05 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 2.008243E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.430 | TFLOPs: 40.23 | 15: iteration 62600/ 125429 | consumed samples: 16025600 | consumed tokens: 32820428800 | elapsed time per iteration (s): 1.08 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 2.012656E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.056 | TFLOPs: 39.34 | 15: iteration 62610/ 125429 | consumed samples: 16028160 | consumed tokens: 32825671680 | elapsed time per iteration (s): 1.07 | learning rate: 1.117E-04 | global batch size: 256 | lm loss: 1.989703E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.351 | TFLOPs: 39.55 | 15: iteration 62620/ 125429 | consumed samples: 16030720 | consumed tokens: 32830914560 | elapsed time per iteration (s): 1.03 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.998612E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.362 | TFLOPs: 41.04 | 15: iteration 62630/ 125429 | consumed samples: 16033280 | consumed tokens: 32836157440 | elapsed time per iteration (s): 1.18 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.997860E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.528 | TFLOPs: 35.95 | 15: iteration 62640/ 125429 | consumed samples: 16035840 | consumed tokens: 32841400320 | elapsed time per iteration (s): 1.10 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.987024E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.224 | TFLOPs: 38.38 | 15: iteration 62650/ 125429 | consumed samples: 16038400 | consumed tokens: 32846643200 | elapsed time per iteration (s): 1.03 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.974478E+00 | grad norm: 0.490 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.976 | TFLOPs: 41.15 | 15: iteration 62660/ 125429 | consumed samples: 16040960 | consumed tokens: 32851886080 | elapsed time per iteration (s): 1.14 | learning rate: 1.116E-04 | global batch size: 256 | lm loss: 1.973229E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.128 | TFLOPs: 37.20 | 15: iteration 62670/ 125429 | consumed samples: 16043520 | consumed tokens: 32857128960 | elapsed time per iteration (s): 1.15 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 1.956566E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.474 | TFLOPs: 36.93 | 15: iteration 62680/ 125429 | consumed samples: 16046080 | consumed tokens: 32862371840 | elapsed time per iteration (s): 1.02 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 2.017275E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.205 | TFLOPs: 41.35 | 15: iteration 62690/ 125429 | consumed samples: 16048640 | consumed tokens: 32867614720 | elapsed time per iteration (s): 1.04 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 2.005731E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.620 | TFLOPs: 40.76 | 15: iteration 62700/ 125429 | consumed samples: 16051200 | consumed tokens: 32872857600 | elapsed time per iteration (s): 1.07 | learning rate: 1.115E-04 | global batch size: 256 | lm loss: 1.991751E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.186 | TFLOPs: 39.53 | 15: iteration 62710/ 125429 | consumed samples: 16053760 | consumed tokens: 32878100480 | elapsed time per iteration (s): 1.02 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 1.979343E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.842 | TFLOPs: 41.45 | 15: iteration 62720/ 125429 | consumed samples: 16056320 | consumed tokens: 32883343360 | elapsed time per iteration (s): 1.05 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 1.985057E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.244 | TFLOPs: 40.20 | 15: iteration 62730/ 125429 | consumed samples: 16058880 | consumed tokens: 32888586240 | elapsed time per iteration (s): 1.05 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 1.991427E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.962 | TFLOPs: 40.32 | 15: iteration 62740/ 125429 | consumed samples: 16061440 | consumed tokens: 32893829120 | elapsed time per iteration (s): 1.05 | learning rate: 1.114E-04 | global batch size: 256 | lm loss: 1.955911E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.777 | TFLOPs: 40.29 | 15: iteration 62750/ 125429 | consumed samples: 16064000 | consumed tokens: 32899072000 | elapsed time per iteration (s): 1.05 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.994485E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.356 | TFLOPs: 40.38 | 15: iteration 62760/ 125429 | consumed samples: 16066560 | consumed tokens: 32904314880 | elapsed time per iteration (s): 1.04 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.969250E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.487 | TFLOPs: 40.73 | 15: iteration 62770/ 125429 | consumed samples: 16069120 | consumed tokens: 32909557760 | elapsed time per iteration (s): 1.16 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.977728E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.272 | TFLOPs: 36.57 | 15: iteration 62780/ 125429 | consumed samples: 16071680 | consumed tokens: 32914800640 | elapsed time per iteration (s): 1.06 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.970443E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.145 | TFLOPs: 40.02 | 15: iteration 62790/ 125429 | consumed samples: 16074240 | consumed tokens: 32920043520 | elapsed time per iteration (s): 1.11 | learning rate: 1.113E-04 | global batch size: 256 | lm loss: 1.976889E+00 | grad norm: 0.572 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.232 | TFLOPs: 38.05 | 15: iteration 62800/ 125429 | consumed samples: 16076800 | consumed tokens: 32925286400 | elapsed time per iteration (s): 1.03 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 2.005832E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.946 | TFLOPs: 41.14 | 15: iteration 62810/ 125429 | consumed samples: 16079360 | consumed tokens: 32930529280 | elapsed time per iteration (s): 1.13 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 2.023820E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.843 | TFLOPs: 37.49 | 15: iteration 62820/ 125429 | consumed samples: 16081920 | consumed tokens: 32935772160 | elapsed time per iteration (s): 1.18 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 2.037262E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.733 | TFLOPs: 35.98 | 15: iteration 62830/ 125429 | consumed samples: 16084480 | consumed tokens: 32941015040 | elapsed time per iteration (s): 1.18 | learning rate: 1.112E-04 | global batch size: 256 | lm loss: 2.013346E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.368 | TFLOPs: 35.92 | 15: iteration 62840/ 125429 | consumed samples: 16087040 | consumed tokens: 32946257920 | elapsed time per iteration (s): 1.05 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 2.010974E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.802 | TFLOPs: 40.46 | 15: iteration 62850/ 125429 | consumed samples: 16089600 | consumed tokens: 32951500800 | elapsed time per iteration (s): 1.06 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.972970E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.256 | TFLOPs: 39.87 | 15: iteration 62860/ 125429 | consumed samples: 16092160 | consumed tokens: 32956743680 | elapsed time per iteration (s): 1.03 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.968630E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.922 | TFLOPs: 41.14 | 15: iteration 62870/ 125429 | consumed samples: 16094720 | consumed tokens: 32961986560 | elapsed time per iteration (s): 1.04 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 1.970542E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.600 | TFLOPs: 40.75 | 15: iteration 62880/ 125429 | consumed samples: 16097280 | consumed tokens: 32967229440 | elapsed time per iteration (s): 1.14 | learning rate: 1.111E-04 | global batch size: 256 | lm loss: 2.004181E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.529 | TFLOPs: 37.11 | 15: iteration 62890/ 125429 | consumed samples: 16099840 | consumed tokens: 32972472320 | elapsed time per iteration (s): 1.03 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 1.974970E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.874 | TFLOPs: 40.96 | 15: iteration 62900/ 125429 | consumed samples: 16102400 | consumed tokens: 32977715200 | elapsed time per iteration (s): 1.08 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 2.011496E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.657 | TFLOPs: 39.27 | 15: iteration 62910/ 125429 | consumed samples: 16104960 | consumed tokens: 32982958080 | elapsed time per iteration (s): 1.06 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 1.979043E+00 | grad norm: 0.126 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.627 | TFLOPs: 40.10 | 15: iteration 62920/ 125429 | consumed samples: 16107520 | consumed tokens: 32988200960 | elapsed time per iteration (s): 1.09 | learning rate: 1.110E-04 | global batch size: 256 | lm loss: 2.001619E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.824 | TFLOPs: 38.81 | 15: iteration 62930/ 125429 | consumed samples: 16110080 | consumed tokens: 32993443840 | elapsed time per iteration (s): 1.07 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 1.958557E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.450 | TFLOPs: 39.57 | 15: iteration 62940/ 125429 | consumed samples: 16112640 | consumed tokens: 32998686720 | elapsed time per iteration (s): 1.06 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 1.964369E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.743 | TFLOPs: 39.95 | 15: iteration 62950/ 125429 | consumed samples: 16115200 | consumed tokens: 33003929600 | elapsed time per iteration (s): 1.04 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 1.974223E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.251 | TFLOPs: 40.53 | 15: iteration 62960/ 125429 | consumed samples: 16117760 | consumed tokens: 33009172480 | elapsed time per iteration (s): 1.08 | learning rate: 1.109E-04 | global batch size: 256 | lm loss: 2.003145E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.531 | TFLOPs: 39.25 | 15: iteration 62970/ 125429 | consumed samples: 16120320 | consumed tokens: 33014415360 | elapsed time per iteration (s): 1.10 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.978682E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.973 | TFLOPs: 38.50 | 15: iteration 62980/ 125429 | consumed samples: 16122880 | consumed tokens: 33019658240 | elapsed time per iteration (s): 1.03 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.990429E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.145 | TFLOPs: 41.01 | 15: iteration 62990/ 125429 | consumed samples: 16125440 | consumed tokens: 33024901120 | elapsed time per iteration (s): 1.03 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 2.005460E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.848 | TFLOPs: 41.12 | 15: iteration 63000/ 125429 | consumed samples: 16128000 | consumed tokens: 33030144000 | elapsed time per iteration (s): 1.07 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.962093E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.972 | TFLOPs: 39.66 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 63000 | lm loss value: 1.847091E+00 | lm loss PPL: 6.341344E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 63000 to checkpoints_1b5 0: [2022-11-26 14:44:59,003] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step63000 is begin to save! 0: [2022-11-26 14:44:59,011] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_01-model_00-model_states.pt... 0: [2022-11-26 14:44:59,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_01-model_00-model_states.pt. 0: [2022-11-26 14:44:59,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_03-model_00-model_states.pt... 0: [2022-11-26 14:44:59,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_03-model_00-model_states.pt. 0: [2022-11-26 14:44:59,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_04-model_00-model_states.pt... 0: [2022-11-26 14:44:59,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_04-model_00-model_states.pt. 0: [2022-11-26 14:44:59,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_05-model_00-model_states.pt... 0: [2022-11-26 14:44:59,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_05-model_00-model_states.pt. 0: [2022-11-26 14:44:59,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_06-model_00-model_states.pt... 0: [2022-11-26 14:44:59,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_06-model_00-model_states.pt. 0: [2022-11-26 14:44:59,708] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_07-model_00-model_states.pt... 0: [2022-11-26 14:44:59,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_07-model_00-model_states.pt. 0: [2022-11-26 14:44:59,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_08-model_00-model_states.pt... 0: [2022-11-26 14:44:59,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_08-model_00-model_states.pt. 0: [2022-11-26 14:44:59,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_09-model_00-model_states.pt... 0: [2022-11-26 14:45:00,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_09-model_00-model_states.pt. 0: [2022-11-26 14:45:00,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_10-model_00-model_states.pt... 0: [2022-11-26 14:45:00,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_10-model_00-model_states.pt. 0: [2022-11-26 14:45:00,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_11-model_00-model_states.pt... 0: [2022-11-26 14:45:00,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_11-model_00-model_states.pt. 0: [2022-11-26 14:45:00,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_12-model_00-model_states.pt... 0: [2022-11-26 14:45:00,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_12-model_00-model_states.pt. 0: [2022-11-26 14:45:00,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_13-model_00-model_states.pt... 0: [2022-11-26 14:45:00,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_13-model_00-model_states.pt. 0: [2022-11-26 14:45:00,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_14-model_00-model_states.pt... 0: [2022-11-26 14:45:00,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_14-model_00-model_states.pt. 0: [2022-11-26 14:45:00,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_15-model_00-model_states.pt... 0: [2022-11-26 14:45:00,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_15-model_00-model_states.pt. 0: [2022-11-26 14:45:00,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_16-model_00-model_states.pt... 0: [2022-11-26 14:45:00,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_16-model_00-model_states.pt. 0: [2022-11-26 14:45:00,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_17-model_00-model_states.pt... 0: [2022-11-26 14:45:00,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_17-model_00-model_states.pt. 0: [2022-11-26 14:45:00,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_18-model_00-model_states.pt... 0: [2022-11-26 14:45:00,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_18-model_00-model_states.pt. 0: [2022-11-26 14:45:00,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_19-model_00-model_states.pt... 0: [2022-11-26 14:45:01,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_19-model_00-model_states.pt. 0: [2022-11-26 14:45:01,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_20-model_00-model_states.pt... 0: [2022-11-26 14:45:01,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_20-model_00-model_states.pt. 0: [2022-11-26 14:45:01,177] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_21-model_00-model_states.pt... 0: [2022-11-26 14:45:01,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_21-model_00-model_states.pt. 0: [2022-11-26 14:45:01,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_22-model_00-model_states.pt... 0: [2022-11-26 14:45:01,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_22-model_00-model_states.pt. 0: [2022-11-26 14:45:01,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_23-model_00-model_states.pt... 0: [2022-11-26 14:45:01,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_23-model_00-model_states.pt. 0: [2022-11-26 14:45:01,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_24-model_00-model_states.pt... 0: [2022-11-26 14:45:01,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_24-model_00-model_states.pt. 0: [2022-11-26 14:45:01,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_25-model_00-model_states.pt... 0: [2022-11-26 14:45:01,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_25-model_00-model_states.pt. 0: [2022-11-26 14:45:01,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_26-model_00-model_states.pt... 0: [2022-11-26 14:45:01,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_26-model_00-model_states.pt. 0: [2022-11-26 14:45:01,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_27-model_00-model_states.pt... 0: [2022-11-26 14:45:01,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_27-model_00-model_states.pt. 0: [2022-11-26 14:45:01,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_28-model_00-model_states.pt... 0: [2022-11-26 14:45:02,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_28-model_00-model_states.pt. 0: [2022-11-26 14:45:02,008] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_29-model_00-model_states.pt... 0: [2022-11-26 14:45:02,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_29-model_00-model_states.pt. 0: [2022-11-26 14:45:02,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_30-model_00-model_states.pt... 0: [2022-11-26 14:45:02,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_30-model_00-model_states.pt. 0: [2022-11-26 14:45:02,215] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/layer_32-model_00-model_states.pt... 0: [2022-11-26 14:45:02,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/layer_32-model_00-model_states.pt. 0: [2022-11-26 14:45:02,220] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step63000/mp_rank_00_model_states.pt 0: [2022-11-26 14:45:02,220] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/mp_rank_00_model_states.pt... 0: [2022-11-26 14:45:02,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/mp_rank_00_model_states.pt. 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 4: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 5: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 2: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 11: [2022-11-26 14:45:02,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step63000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 12: [2022-11-26 14:45:02,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 14:45:02,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:45:02,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:45:02,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:45:02,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:45:02,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 14:45:02,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:45:02,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:45:02,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:45:02,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 14:45:02,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 14:45:02,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 14:45:02,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 14:45:02,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:45:02,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:45:02,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:45:02,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 3: [2022-11-26 14:45:02,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 14:45:02,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 14:45:02,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 7: [2022-11-26 14:45:02,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 14:45:02,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 14:45:02,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,450] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,450] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 4: [2022-11-26 14:45:02,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 14:45:02,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 14:45:02,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 14:45:02,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 14:45:02,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 5: [2022-11-26 14:45:02,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 14:45:02,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 14:45:02,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 14:45:02,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 14:45:02,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 14:45:02,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 14:45:02,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 14:45:02,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 14:45:02,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 12: [2022-11-26 14:45:02,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 14:45:02,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 14:45:02,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:45:02,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:45:02,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:45:02,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:45:02,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:45:02,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 6: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 14:45:02,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 10: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 14:45:02,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:45:02,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:45:02,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 9: [2022-11-26 14:45:02,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 14:45:02,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 14:45:02,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:45:02,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 14:45:02,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 14:45:02,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 11: [2022-11-26 14:45:02,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 14:45:02,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 14:45:02,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 1: [2022-11-26 14:45:02,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 14:45:02,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 14:45:02,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 14:45:02,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: [2022-11-26 14:45:02,594] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 14:45:02,594] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 13: [2022-11-26 14:45:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 14:45:02,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 14:45:02,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 14:45:02,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 14: [2022-11-26 14:45:02,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 14:45:02,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 14:45:02,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 2: [2022-11-26 14:45:02,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step63000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 15: [2022-11-26 14:45:02,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step63000 is ready now! 0: successfully saved checkpoint at iteration 63000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3757.22 15: iteration 63010/ 125429 | consumed samples: 16130560 | consumed tokens: 33035386880 | elapsed time per iteration (s): 1.44 | learning rate: 1.108E-04 | global batch size: 256 | lm loss: 1.973305E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.632 | TFLOPs: 29.36 | 15: iteration 63020/ 125429 | consumed samples: 16133120 | consumed tokens: 33040629760 | elapsed time per iteration (s): 1.03 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 1.945246E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.452 | TFLOPs: 41.22 | 15: iteration 63030/ 125429 | consumed samples: 16135680 | consumed tokens: 33045872640 | elapsed time per iteration (s): 1.03 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 2.030341E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.398 | TFLOPs: 41.05 | 15: iteration 63040/ 125429 | consumed samples: 16138240 | consumed tokens: 33051115520 | elapsed time per iteration (s): 1.03 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 1.985161E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.145 | TFLOPs: 41.01 | 15: iteration 63050/ 125429 | consumed samples: 16140800 | consumed tokens: 33056358400 | elapsed time per iteration (s): 1.06 | learning rate: 1.107E-04 | global batch size: 256 | lm loss: 1.982751E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.747 | TFLOPs: 39.95 | 15: iteration 63060/ 125429 | consumed samples: 16143360 | consumed tokens: 33061601280 | elapsed time per iteration (s): 1.03 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 2.022340E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.326 | TFLOPs: 41.20 | 15: iteration 63070/ 125429 | consumed samples: 16145920 | consumed tokens: 33066844160 | elapsed time per iteration (s): 1.03 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 2.004533E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.312 | TFLOPs: 41.20 | 15: iteration 63080/ 125429 | consumed samples: 16148480 | consumed tokens: 33072087040 | elapsed time per iteration (s): 1.03 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 1.968291E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.554 | TFLOPs: 41.08 | 15: iteration 63090/ 125429 | consumed samples: 16151040 | consumed tokens: 33077329920 | elapsed time per iteration (s): 1.03 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 1.974458E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.662 | TFLOPs: 41.09 | 15: iteration 63100/ 125429 | consumed samples: 16153600 | consumed tokens: 33082572800 | elapsed time per iteration (s): 1.05 | learning rate: 1.106E-04 | global batch size: 256 | lm loss: 2.002952E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.517 | TFLOPs: 40.24 | 15: iteration 63110/ 125429 | consumed samples: 16156160 | consumed tokens: 33087815680 | elapsed time per iteration (s): 1.08 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.961386E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.994 | TFLOPs: 39.00 | 15: iteration 63120/ 125429 | consumed samples: 16158720 | consumed tokens: 33093058560 | elapsed time per iteration (s): 1.08 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.964764E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.219 | TFLOPs: 39.20 | 15: iteration 63130/ 125429 | consumed samples: 16161280 | consumed tokens: 33098301440 | elapsed time per iteration (s): 1.04 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.979277E+00 | grad norm: 0.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.140 | TFLOPs: 40.68 | 15: iteration 63140/ 125429 | consumed samples: 16163840 | consumed tokens: 33103544320 | elapsed time per iteration (s): 1.05 | learning rate: 1.105E-04 | global batch size: 256 | lm loss: 1.996666E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.370 | TFLOPs: 40.38 | 15: iteration 63150/ 125429 | consumed samples: 16166400 | consumed tokens: 33108787200 | elapsed time per iteration (s): 1.06 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.983044E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.989 | TFLOPs: 39.99 | 15: iteration 63160/ 125429 | consumed samples: 16168960 | consumed tokens: 33114030080 | elapsed time per iteration (s): 1.04 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.980758E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.157 | TFLOPs: 40.68 | 15: iteration 63170/ 125429 | consumed samples: 16171520 | consumed tokens: 33119272960 | elapsed time per iteration (s): 1.03 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 1.950393E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.939 | TFLOPs: 41.14 | 15: iteration 63180/ 125429 | consumed samples: 16174080 | consumed tokens: 33124515840 | elapsed time per iteration (s): 1.04 | learning rate: 1.104E-04 | global batch size: 256 | lm loss: 2.000949E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.459 | TFLOPs: 40.56 | 15: iteration 63190/ 125429 | consumed samples: 16176640 | consumed tokens: 33129758720 | elapsed time per iteration (s): 1.04 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.988593E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.742 | TFLOPs: 40.61 | 15: iteration 63200/ 125429 | consumed samples: 16179200 | consumed tokens: 33135001600 | elapsed time per iteration (s): 1.04 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.986570E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.998 | TFLOPs: 40.65 | 15: iteration 63210/ 125429 | consumed samples: 16181760 | consumed tokens: 33140244480 | elapsed time per iteration (s): 1.06 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 2.014505E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.799 | TFLOPs: 39.79 | 15: iteration 63220/ 125429 | consumed samples: 16184320 | consumed tokens: 33145487360 | elapsed time per iteration (s): 1.02 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.953053E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.667 | TFLOPs: 41.42 | 15: iteration 63230/ 125429 | consumed samples: 16186880 | consumed tokens: 33150730240 | elapsed time per iteration (s): 1.03 | learning rate: 1.103E-04 | global batch size: 256 | lm loss: 1.990505E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.790 | TFLOPs: 41.11 | 15: iteration 63240/ 125429 | consumed samples: 16189440 | consumed tokens: 33155973120 | elapsed time per iteration (s): 1.05 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 1.977996E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.930 | TFLOPs: 40.48 | 15: iteration 63250/ 125429 | consumed samples: 16192000 | consumed tokens: 33161216000 | elapsed time per iteration (s): 1.04 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 1.947513E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.807 | TFLOPs: 40.62 | 15: iteration 63260/ 125429 | consumed samples: 16194560 | consumed tokens: 33166458880 | elapsed time per iteration (s): 1.03 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 1.967374E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.402 | TFLOPs: 41.22 | 15: iteration 63270/ 125429 | consumed samples: 16197120 | consumed tokens: 33171701760 | elapsed time per iteration (s): 1.03 | learning rate: 1.102E-04 | global batch size: 256 | lm loss: 1.962486E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.457 | TFLOPs: 41.22 | 15: iteration 63280/ 125429 | consumed samples: 16199680 | consumed tokens: 33176944640 | elapsed time per iteration (s): 1.07 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 1.981450E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.914 | TFLOPs: 39.65 | 15: iteration 63290/ 125429 | consumed samples: 16202240 | consumed tokens: 33182187520 | elapsed time per iteration (s): 1.02 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 1.970917E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.890 | TFLOPs: 41.30 | 15: iteration 63300/ 125429 | consumed samples: 16204800 | consumed tokens: 33187430400 | elapsed time per iteration (s): 1.04 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 1.974990E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.785 | TFLOPs: 40.62 | 15: iteration 63310/ 125429 | consumed samples: 16207360 | consumed tokens: 33192673280 | elapsed time per iteration (s): 1.05 | learning rate: 1.101E-04 | global batch size: 256 | lm loss: 1.991416E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.876 | TFLOPs: 40.14 | 15: iteration 63320/ 125429 | consumed samples: 16209920 | consumed tokens: 33197916160 | elapsed time per iteration (s): 1.06 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 1.980605E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.721 | TFLOPs: 39.78 | 15: iteration 63330/ 125429 | consumed samples: 16212480 | consumed tokens: 33203159040 | elapsed time per iteration (s): 1.06 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 1.983902E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.188 | TFLOPs: 39.86 | 15: iteration 63340/ 125429 | consumed samples: 16215040 | consumed tokens: 33208401920 | elapsed time per iteration (s): 1.05 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 1.963248E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.732 | TFLOPs: 40.11 | 15: iteration 63350/ 125429 | consumed samples: 16217600 | consumed tokens: 33213644800 | elapsed time per iteration (s): 1.05 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 1.990283E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.412 | TFLOPs: 40.39 | 15: iteration 63360/ 125429 | consumed samples: 16220160 | consumed tokens: 33218887680 | elapsed time per iteration (s): 1.03 | learning rate: 1.100E-04 | global batch size: 256 | lm loss: 1.989087E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.292 | TFLOPs: 41.20 | 15: iteration 63370/ 125429 | consumed samples: 16222720 | consumed tokens: 33224130560 | elapsed time per iteration (s): 1.07 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 1.994046E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.329 | TFLOPs: 39.72 | 15: iteration 63380/ 125429 | consumed samples: 16225280 | consumed tokens: 33229373440 | elapsed time per iteration (s): 1.05 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 1.962202E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.151 | TFLOPs: 40.35 | 15: iteration 63390/ 125429 | consumed samples: 16227840 | consumed tokens: 33234616320 | elapsed time per iteration (s): 1.04 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 1.982700E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.048 | TFLOPs: 40.66 | 15: iteration 63400/ 125429 | consumed samples: 16230400 | consumed tokens: 33239859200 | elapsed time per iteration (s): 1.04 | learning rate: 1.099E-04 | global batch size: 256 | lm loss: 1.990952E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.359 | TFLOPs: 40.55 | 15: iteration 63410/ 125429 | consumed samples: 16232960 | consumed tokens: 33245102080 | elapsed time per iteration (s): 1.06 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.978299E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.837 | TFLOPs: 39.97 | 15: iteration 63420/ 125429 | consumed samples: 16235520 | consumed tokens: 33250344960 | elapsed time per iteration (s): 1.02 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.989259E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.348 | TFLOPs: 41.37 | 15: iteration 63430/ 125429 | consumed samples: 16238080 | consumed tokens: 33255587840 | elapsed time per iteration (s): 1.08 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.990419E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.032 | TFLOPs: 39.34 | 15: iteration 63440/ 125429 | consumed samples: 16240640 | consumed tokens: 33260830720 | elapsed time per iteration (s): 1.04 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.994064E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.197 | TFLOPs: 40.85 | 15: iteration 63450/ 125429 | consumed samples: 16243200 | consumed tokens: 33266073600 | elapsed time per iteration (s): 1.03 | learning rate: 1.098E-04 | global batch size: 256 | lm loss: 1.963565E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.373 | TFLOPs: 40.88 | 15: iteration 63460/ 125429 | consumed samples: 16245760 | consumed tokens: 33271316480 | elapsed time per iteration (s): 1.03 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 1.973540E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.550 | TFLOPs: 40.91 | 15: iteration 63470/ 125429 | consumed samples: 16248320 | consumed tokens: 33276559360 | elapsed time per iteration (s): 1.08 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 1.994363E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.067 | TFLOPs: 39.34 | 15: iteration 63480/ 125429 | consumed samples: 16250880 | consumed tokens: 33281802240 | elapsed time per iteration (s): 1.05 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 1.975065E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.956 | TFLOPs: 40.48 | 15: iteration 63490/ 125429 | consumed samples: 16253440 | consumed tokens: 33287045120 | elapsed time per iteration (s): 1.05 | learning rate: 1.097E-04 | global batch size: 256 | lm loss: 1.994853E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.943 | TFLOPs: 40.48 | 15: iteration 63500/ 125429 | consumed samples: 16256000 | consumed tokens: 33292288000 | elapsed time per iteration (s): 1.03 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.997750E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.065 | TFLOPs: 40.99 | 15: iteration 63510/ 125429 | consumed samples: 16258560 | consumed tokens: 33297530880 | elapsed time per iteration (s): 1.03 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.991607E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.601 | TFLOPs: 41.08 | 15: iteration 63520/ 125429 | consumed samples: 16261120 | consumed tokens: 33302773760 | elapsed time per iteration (s): 1.02 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.986237E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.573 | TFLOPs: 41.57 | 15: iteration 63530/ 125429 | consumed samples: 16263680 | consumed tokens: 33308016640 | elapsed time per iteration (s): 1.04 | learning rate: 1.096E-04 | global batch size: 256 | lm loss: 1.982114E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.818 | TFLOPs: 40.62 | 15: iteration 63540/ 125429 | consumed samples: 16266240 | consumed tokens: 33313259520 | elapsed time per iteration (s): 1.03 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 1.967146E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.672 | TFLOPs: 41.26 | 15: iteration 63550/ 125429 | consumed samples: 16268800 | consumed tokens: 33318502400 | elapsed time per iteration (s): 1.05 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 1.999370E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.505 | TFLOPs: 40.41 | 15: iteration 63560/ 125429 | consumed samples: 16271360 | consumed tokens: 33323745280 | elapsed time per iteration (s): 1.06 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 2.006608E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.515 | TFLOPs: 40.08 | 15: iteration 63570/ 125429 | consumed samples: 16273920 | consumed tokens: 33328988160 | elapsed time per iteration (s): 1.05 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 1.971671E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.072 | TFLOPs: 40.17 | 15: iteration 63580/ 125429 | consumed samples: 16276480 | consumed tokens: 33334231040 | elapsed time per iteration (s): 1.10 | learning rate: 1.095E-04 | global batch size: 256 | lm loss: 2.000999E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.486 | TFLOPs: 38.59 | 15: iteration 63590/ 125429 | consumed samples: 16279040 | consumed tokens: 33339473920 | elapsed time per iteration (s): 1.03 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 1.988977E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.834 | TFLOPs: 40.96 | 15: iteration 63600/ 125429 | consumed samples: 16281600 | consumed tokens: 33344716800 | elapsed time per iteration (s): 1.07 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 1.971521E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.346 | TFLOPs: 39.39 | 15: iteration 63610/ 125429 | consumed samples: 16284160 | consumed tokens: 33349959680 | elapsed time per iteration (s): 1.06 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 1.992058E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.442 | TFLOPs: 39.90 | 15: iteration 63620/ 125429 | consumed samples: 16286720 | consumed tokens: 33355202560 | elapsed time per iteration (s): 1.04 | learning rate: 1.094E-04 | global batch size: 256 | lm loss: 2.025047E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.224 | TFLOPs: 40.69 | 15: iteration 63630/ 125429 | consumed samples: 16289280 | consumed tokens: 33360445440 | elapsed time per iteration (s): 1.05 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 1.996293E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.954 | TFLOPs: 40.32 | 15: iteration 63640/ 125429 | consumed samples: 16291840 | consumed tokens: 33365688320 | elapsed time per iteration (s): 1.04 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 1.974902E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.132 | TFLOPs: 40.68 | 15: iteration 63650/ 125429 | consumed samples: 16294400 | consumed tokens: 33370931200 | elapsed time per iteration (s): 1.04 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 1.980158E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.604 | TFLOPs: 40.75 | 15: iteration 63660/ 125429 | consumed samples: 16296960 | consumed tokens: 33376174080 | elapsed time per iteration (s): 1.05 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 1.969815E+00 | grad norm: 0.124 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.795 | TFLOPs: 40.29 | 15: iteration 63670/ 125429 | consumed samples: 16299520 | consumed tokens: 33381416960 | elapsed time per iteration (s): 1.04 | learning rate: 1.093E-04 | global batch size: 256 | lm loss: 2.002820E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.324 | TFLOPs: 40.54 | 15: iteration 63680/ 125429 | consumed samples: 16302080 | consumed tokens: 33386659840 | elapsed time per iteration (s): 1.03 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.940634E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.599 | TFLOPs: 41.25 | 15: iteration 63690/ 125429 | consumed samples: 16304640 | consumed tokens: 33391902720 | elapsed time per iteration (s): 1.03 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.972394E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.945 | TFLOPs: 40.97 | 15: iteration 63700/ 125429 | consumed samples: 16307200 | consumed tokens: 33397145600 | elapsed time per iteration (s): 1.07 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.966917E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.694 | TFLOPs: 39.61 | 15: iteration 63710/ 125429 | consumed samples: 16309760 | consumed tokens: 33402388480 | elapsed time per iteration (s): 1.07 | learning rate: 1.092E-04 | global batch size: 256 | lm loss: 1.999061E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.008 | TFLOPs: 39.50 | 15: iteration 63720/ 125429 | consumed samples: 16312320 | consumed tokens: 33407631360 | elapsed time per iteration (s): 1.02 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.974982E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.512 | TFLOPs: 41.56 | 15: iteration 63730/ 125429 | consumed samples: 16314880 | consumed tokens: 33412874240 | elapsed time per iteration (s): 1.04 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.975643E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.334 | TFLOPs: 40.54 | 15: iteration 63740/ 125429 | consumed samples: 16317440 | consumed tokens: 33418117120 | elapsed time per iteration (s): 1.05 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 1.976891E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.532 | TFLOPs: 40.41 | 15: iteration 63750/ 125429 | consumed samples: 16320000 | consumed tokens: 33423360000 | elapsed time per iteration (s): 1.03 | learning rate: 1.091E-04 | global batch size: 256 | lm loss: 2.010498E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.348 | TFLOPs: 40.88 | 15: iteration 63760/ 125429 | consumed samples: 16322560 | consumed tokens: 33428602880 | elapsed time per iteration (s): 1.06 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.972965E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.235 | TFLOPs: 40.03 | 15: iteration 63770/ 125429 | consumed samples: 16325120 | consumed tokens: 33433845760 | elapsed time per iteration (s): 1.03 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.967755E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.532 | TFLOPs: 40.91 | 15: iteration 63780/ 125429 | consumed samples: 16327680 | consumed tokens: 33439088640 | elapsed time per iteration (s): 1.03 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.993749E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.790 | TFLOPs: 41.11 | 15: iteration 63790/ 125429 | consumed samples: 16330240 | consumed tokens: 33444331520 | elapsed time per iteration (s): 1.02 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.996737E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.477 | TFLOPs: 41.39 | 15: iteration 63800/ 125429 | consumed samples: 16332800 | consumed tokens: 33449574400 | elapsed time per iteration (s): 1.04 | learning rate: 1.090E-04 | global batch size: 256 | lm loss: 1.999826E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.189 | TFLOPs: 40.52 | 15: iteration 63810/ 125429 | consumed samples: 16335360 | consumed tokens: 33454817280 | elapsed time per iteration (s): 1.03 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 1.981885E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.387 | TFLOPs: 40.88 | 15: iteration 63820/ 125429 | consumed samples: 16337920 | consumed tokens: 33460060160 | elapsed time per iteration (s): 1.02 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 1.962286E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.845 | TFLOPs: 41.45 | 15: iteration 63830/ 125429 | consumed samples: 16340480 | consumed tokens: 33465303040 | elapsed time per iteration (s): 1.03 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 1.987101E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.093 | TFLOPs: 41.00 | 15: iteration 63840/ 125429 | consumed samples: 16343040 | consumed tokens: 33470545920 | elapsed time per iteration (s): 1.03 | learning rate: 1.089E-04 | global batch size: 256 | lm loss: 2.001418E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.791 | TFLOPs: 41.11 | 15: iteration 63850/ 125429 | consumed samples: 16345600 | consumed tokens: 33475788800 | elapsed time per iteration (s): 1.07 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.989387E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.773 | TFLOPs: 39.46 | 15: iteration 63860/ 125429 | consumed samples: 16348160 | consumed tokens: 33481031680 | elapsed time per iteration (s): 1.07 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.971760E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.732 | TFLOPs: 39.62 | 15: iteration 63870/ 125429 | consumed samples: 16350720 | consumed tokens: 33486274560 | elapsed time per iteration (s): 1.04 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.988727E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.839 | TFLOPs: 40.79 | 15: iteration 63880/ 125429 | consumed samples: 16353280 | consumed tokens: 33491517440 | elapsed time per iteration (s): 1.04 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.984089E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.784 | TFLOPs: 40.78 | 15: iteration 63890/ 125429 | consumed samples: 16355840 | consumed tokens: 33496760320 | elapsed time per iteration (s): 1.05 | learning rate: 1.088E-04 | global batch size: 256 | lm loss: 1.973674E+00 | grad norm: 0.125 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.803 | TFLOPs: 40.46 | 15: iteration 63900/ 125429 | consumed samples: 16358400 | consumed tokens: 33502003200 | elapsed time per iteration (s): 1.04 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.984347E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.013 | TFLOPs: 40.66 | 15: iteration 63910/ 125429 | consumed samples: 16360960 | consumed tokens: 33507246080 | elapsed time per iteration (s): 1.04 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.986298E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.198 | TFLOPs: 40.85 | 15: iteration 63920/ 125429 | consumed samples: 16363520 | consumed tokens: 33512488960 | elapsed time per iteration (s): 1.05 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.981785E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.172 | TFLOPs: 40.35 | 15: iteration 63930/ 125429 | consumed samples: 16366080 | consumed tokens: 33517731840 | elapsed time per iteration (s): 1.03 | learning rate: 1.087E-04 | global batch size: 256 | lm loss: 1.972505E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.185 | TFLOPs: 41.01 | 15: iteration 63940/ 125429 | consumed samples: 16368640 | consumed tokens: 33522974720 | elapsed time per iteration (s): 1.08 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 1.965734E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.954 | TFLOPs: 38.99 | 15: iteration 63950/ 125429 | consumed samples: 16371200 | consumed tokens: 33528217600 | elapsed time per iteration (s): 1.07 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 2.001746E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.625 | TFLOPs: 39.43 | 15: iteration 63960/ 125429 | consumed samples: 16373760 | consumed tokens: 33533460480 | elapsed time per iteration (s): 1.06 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 1.963664E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.865 | TFLOPs: 39.80 | 15: iteration 63970/ 125429 | consumed samples: 16376320 | consumed tokens: 33538703360 | elapsed time per iteration (s): 1.05 | learning rate: 1.086E-04 | global batch size: 256 | lm loss: 1.965792E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.570 | TFLOPs: 40.42 | 15: iteration 63980/ 125429 | consumed samples: 16378880 | consumed tokens: 33543946240 | elapsed time per iteration (s): 1.04 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.965070E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.404 | TFLOPs: 40.55 | 15: iteration 63990/ 125429 | consumed samples: 16381440 | consumed tokens: 33549189120 | elapsed time per iteration (s): 1.03 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.996314E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.008 | TFLOPs: 40.99 | 0: [2022-11-26 15:02:28,030] [INFO] [logging.py:68:log_dist] [Rank 0] step=64000, skipped=0, lr=[0.0001085010744552345, 0.0001085010744552345, 0.0001085010744552345], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 64000/ 125429 | consumed samples: 16384000 | consumed tokens: 33554432000 | elapsed time per iteration (s): 1.14 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.985041E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.195 | TFLOPs: 37.22 | 0: steps: 64000 loss: 2.0252 iter time (s): 1.052 samples/sec: 243.276 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 64000 | lm loss value: 2.055081E+00 | lm loss PPL: 7.807473E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 64000 to checkpoints_1b5 0: [2022-11-26 15:02:28,387] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step64000 is begin to save! 0: [2022-11-26 15:02:28,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:02:28,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:02:28,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:02:28,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:02:28,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:02:28,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:02:28,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:02:28,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:02:28,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:02:29,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:02:29,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:02:29,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:02:29,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:02:29,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:02:29,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:02:29,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:02:29,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:02:29,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:02:29,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:02:29,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:02:29,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:02:29,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:02:29,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:02:29,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:02:29,858] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:02:29,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:02:29,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:02:30,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:02:30,077] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:02:30,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:02:30,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:02:30,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:02:30,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:02:30,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:02:30,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:02:30,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:02:30,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:02:30,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:02:30,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:02:30,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:02:30,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:02:30,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:02:30,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:02:30,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:02:30,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:02:31,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:02:31,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:02:31,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:02:31,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:02:31,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:02:31,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:02:31,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:02:31,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:02:31,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:02:31,467] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_29-model_00-model_states.pt... 0: [2022-11-26 15:02:31,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_29-model_00-model_states.pt. 0: [2022-11-26 15:02:31,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:02:31,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:02:31,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/layer_32-model_00-model_states.pt... 0: [2022-11-26 15:02:31,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/layer_32-model_00-model_states.pt. 0: [2022-11-26 15:02:31,684] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step64000/mp_rank_00_model_states.pt 0: [2022-11-26 15:02:31,684] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:02:31,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:02:31,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step64000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:02:31,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:02:31,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 15:02:31,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 15:02:31,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 15:02:31,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 15:02:31,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:02:31,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:02:31,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 15:02:31,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:02:31,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:02:31,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 15:02:31,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 15:02:31,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:02:31,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 15:02:31,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:02:31,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:02:31,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:02:31,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 15:02:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 15:02:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 15:02:31,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 15:02:31,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:02:31,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 15:02:31,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 15:02:31,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:02:31,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:02:31,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:02:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:02:31,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 8: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 15:02:31,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 15:02:31,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:02:31,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:02:31,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 15:02:31,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:02:31,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 15:02:31,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 2: [2022-11-26 15:02:31,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:02:31,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:02:31,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:02:31,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:02:31,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:02:31,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:02:31,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 15: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 3: [2022-11-26 15:02:31,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:02:31,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:02:31,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:02:31,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:02:31,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:02:31,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:02:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:02:31,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 13: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:02:31,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 8: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:02:31,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:02:31,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:02:31,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 15:02:31,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:31,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 15:02:31,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 15:02:31,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 15:02:31,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 12: [2022-11-26 15:02:31,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:02:31,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 15:02:31,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 5: [2022-11-26 15:02:31,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:02:31,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 15:02:31,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 10: [2022-11-26 15:02:31,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:02:31,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 15:02:31,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:02:31,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 7: [2022-11-26 15:02:31,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:02:31,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:02:31,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:02:31,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 1: [2022-11-26 15:02:31,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:02:31,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 15:02:31,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 11: [2022-11-26 15:02:31,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 15:02:31,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 4: [2022-11-26 15:02:31,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:02:31,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:02:31,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 9: [2022-11-26 15:02:31,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 4: [2022-11-26 15:02:31,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 9: [2022-11-26 15:02:31,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,005] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,005] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 6: [2022-11-26 15:02:32,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:02:32,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:02:32,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 14: [2022-11-26 15:02:32,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:02:32,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 15:02:32,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: [2022-11-26 15:02:32,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step64000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 15:02:32,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step64000 is ready now! 0: successfully saved checkpoint at iteration 64000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3775.86 15: iteration 64010/ 125429 | consumed samples: 16386560 | consumed tokens: 33559674880 | elapsed time per iteration (s): 1.46 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.974908E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.916 | TFLOPs: 28.91 | 15: iteration 64020/ 125429 | consumed samples: 16389120 | consumed tokens: 33564917760 | elapsed time per iteration (s): 1.09 | learning rate: 1.085E-04 | global batch size: 256 | lm loss: 1.965099E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.295 | TFLOPs: 38.72 | 15: iteration 64030/ 125429 | consumed samples: 16391680 | consumed tokens: 33570160640 | elapsed time per iteration (s): 1.03 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.986082E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.260 | TFLOPs: 41.03 | 15: iteration 64040/ 125429 | consumed samples: 16394240 | consumed tokens: 33575403520 | elapsed time per iteration (s): 1.07 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.962096E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.029 | TFLOPs: 39.67 | 15: iteration 64050/ 125429 | consumed samples: 16396800 | consumed tokens: 33580646400 | elapsed time per iteration (s): 1.05 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 2.008241E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.076 | TFLOPs: 40.34 | 15: iteration 64060/ 125429 | consumed samples: 16399360 | consumed tokens: 33585889280 | elapsed time per iteration (s): 1.06 | learning rate: 1.084E-04 | global batch size: 256 | lm loss: 1.987992E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.696 | TFLOPs: 39.94 | 15: iteration 64070/ 125429 | consumed samples: 16401920 | consumed tokens: 33591132160 | elapsed time per iteration (s): 1.05 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 2.034221E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.172 | TFLOPs: 40.19 | 15: iteration 64080/ 125429 | consumed samples: 16404480 | consumed tokens: 33596375040 | elapsed time per iteration (s): 1.05 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 1.954818E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.877 | TFLOPs: 40.30 | 15: iteration 64090/ 125429 | consumed samples: 16407040 | consumed tokens: 33601617920 | elapsed time per iteration (s): 1.04 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 1.989446E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.112 | TFLOPs: 40.67 | 15: iteration 64100/ 125429 | consumed samples: 16409600 | consumed tokens: 33606860800 | elapsed time per iteration (s): 1.11 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 1.934960E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.533 | TFLOPs: 38.10 | 15: iteration 64110/ 125429 | consumed samples: 16412160 | consumed tokens: 33612103680 | elapsed time per iteration (s): 1.06 | learning rate: 1.083E-04 | global batch size: 256 | lm loss: 1.982786E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.541 | TFLOPs: 40.08 | 15: iteration 64120/ 125429 | consumed samples: 16414720 | consumed tokens: 33617346560 | elapsed time per iteration (s): 1.05 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.973771E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.378 | TFLOPs: 40.39 | 15: iteration 64130/ 125429 | consumed samples: 16417280 | consumed tokens: 33622589440 | elapsed time per iteration (s): 1.05 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.971534E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.429 | TFLOPs: 40.23 | 15: iteration 64140/ 125429 | consumed samples: 16419840 | consumed tokens: 33627832320 | elapsed time per iteration (s): 1.03 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.972158E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.587 | TFLOPs: 40.92 | 15: iteration 64150/ 125429 | consumed samples: 16422400 | consumed tokens: 33633075200 | elapsed time per iteration (s): 1.06 | learning rate: 1.082E-04 | global batch size: 256 | lm loss: 1.986015E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.504 | TFLOPs: 40.08 | 15: iteration 64160/ 125429 | consumed samples: 16424960 | consumed tokens: 33638318080 | elapsed time per iteration (s): 1.04 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.972925E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.924 | TFLOPs: 40.64 | 15: iteration 64170/ 125429 | consumed samples: 16427520 | consumed tokens: 33643560960 | elapsed time per iteration (s): 1.05 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.979752E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.930 | TFLOPs: 40.48 | 15: iteration 64180/ 125429 | consumed samples: 16430080 | consumed tokens: 33648803840 | elapsed time per iteration (s): 1.05 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.980877E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.556 | TFLOPs: 40.25 | 15: iteration 64190/ 125429 | consumed samples: 16432640 | consumed tokens: 33654046720 | elapsed time per iteration (s): 1.05 | learning rate: 1.081E-04 | global batch size: 256 | lm loss: 1.971437E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.928 | TFLOPs: 40.48 | 15: iteration 64200/ 125429 | consumed samples: 16435200 | consumed tokens: 33659289600 | elapsed time per iteration (s): 1.03 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.986744E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.301 | TFLOPs: 41.20 | 15: iteration 64210/ 125429 | consumed samples: 16437760 | consumed tokens: 33664532480 | elapsed time per iteration (s): 1.03 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.998501E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.442 | TFLOPs: 40.89 | 15: iteration 64220/ 125429 | consumed samples: 16440320 | consumed tokens: 33669775360 | elapsed time per iteration (s): 1.06 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 2.004904E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.285 | TFLOPs: 40.04 | 15: iteration 64230/ 125429 | consumed samples: 16442880 | consumed tokens: 33675018240 | elapsed time per iteration (s): 1.03 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.970412E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.793 | TFLOPs: 40.95 | 15: iteration 64240/ 125429 | consumed samples: 16445440 | consumed tokens: 33680261120 | elapsed time per iteration (s): 1.08 | learning rate: 1.080E-04 | global batch size: 256 | lm loss: 1.980836E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.620 | TFLOPs: 39.27 | 15: iteration 64250/ 125429 | consumed samples: 16448000 | consumed tokens: 33685504000 | elapsed time per iteration (s): 1.09 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.965622E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.133 | TFLOPs: 38.86 | 15: iteration 64260/ 125429 | consumed samples: 16450560 | consumed tokens: 33690746880 | elapsed time per iteration (s): 1.05 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.963619E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.279 | TFLOPs: 40.20 | 15: iteration 64270/ 125429 | consumed samples: 16453120 | consumed tokens: 33695989760 | elapsed time per iteration (s): 1.06 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.995740E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.888 | TFLOPs: 39.97 | 15: iteration 64280/ 125429 | consumed samples: 16455680 | consumed tokens: 33701232640 | elapsed time per iteration (s): 1.02 | learning rate: 1.079E-04 | global batch size: 256 | lm loss: 1.986940E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.470 | TFLOPs: 41.39 | 15: iteration 64290/ 125429 | consumed samples: 16458240 | consumed tokens: 33706475520 | elapsed time per iteration (s): 1.04 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.978345E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.268 | TFLOPs: 40.86 | 15: iteration 64300/ 125429 | consumed samples: 16460800 | consumed tokens: 33711718400 | elapsed time per iteration (s): 1.02 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.994424E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.819 | TFLOPs: 41.28 | 15: iteration 64310/ 125429 | consumed samples: 16463360 | consumed tokens: 33716961280 | elapsed time per iteration (s): 1.02 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.998288E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.964 | TFLOPs: 41.31 | 15: iteration 64320/ 125429 | consumed samples: 16465920 | consumed tokens: 33722204160 | elapsed time per iteration (s): 1.03 | learning rate: 1.078E-04 | global batch size: 256 | lm loss: 1.984729E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.566 | TFLOPs: 40.91 | 15: iteration 64330/ 125429 | consumed samples: 16468480 | consumed tokens: 33727447040 | elapsed time per iteration (s): 1.04 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.986052E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.389 | TFLOPs: 40.55 | 15: iteration 64340/ 125429 | consumed samples: 16471040 | consumed tokens: 33732689920 | elapsed time per iteration (s): 1.07 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.987315E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.097 | TFLOPs: 39.51 | 15: iteration 64350/ 125429 | consumed samples: 16473600 | consumed tokens: 33737932800 | elapsed time per iteration (s): 1.05 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.966370E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.346 | TFLOPs: 40.21 | 15: iteration 64360/ 125429 | consumed samples: 16476160 | consumed tokens: 33743175680 | elapsed time per iteration (s): 1.05 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 2.006293E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.685 | TFLOPs: 40.44 | 15: iteration 64370/ 125429 | consumed samples: 16478720 | consumed tokens: 33748418560 | elapsed time per iteration (s): 1.07 | learning rate: 1.077E-04 | global batch size: 256 | lm loss: 1.987193E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.331 | TFLOPs: 39.72 | 15: iteration 64380/ 125429 | consumed samples: 16481280 | consumed tokens: 33753661440 | elapsed time per iteration (s): 1.03 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.974260E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.919 | TFLOPs: 40.97 | 15: iteration 64390/ 125429 | consumed samples: 16483840 | consumed tokens: 33758904320 | elapsed time per iteration (s): 1.07 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 2.014059E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.045 | TFLOPs: 39.67 | 15: iteration 64400/ 125429 | consumed samples: 16486400 | consumed tokens: 33764147200 | elapsed time per iteration (s): 1.03 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.975275E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.722 | TFLOPs: 40.94 | 15: iteration 64410/ 125429 | consumed samples: 16488960 | consumed tokens: 33769390080 | elapsed time per iteration (s): 1.05 | learning rate: 1.076E-04 | global batch size: 256 | lm loss: 1.955433E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.754 | TFLOPs: 40.45 | 15: iteration 64420/ 125429 | consumed samples: 16491520 | consumed tokens: 33774632960 | elapsed time per iteration (s): 1.05 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 2.008479E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.438 | TFLOPs: 40.40 | 15: iteration 64430/ 125429 | consumed samples: 16494080 | consumed tokens: 33779875840 | elapsed time per iteration (s): 1.06 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 1.972717E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.577 | TFLOPs: 40.09 | 15: iteration 64440/ 125429 | consumed samples: 16496640 | consumed tokens: 33785118720 | elapsed time per iteration (s): 1.05 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 1.967834E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.200 | TFLOPs: 40.36 | 15: iteration 64450/ 125429 | consumed samples: 16499200 | consumed tokens: 33790361600 | elapsed time per iteration (s): 1.02 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 1.965359E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.332 | TFLOPs: 41.37 | 15: iteration 64460/ 125429 | consumed samples: 16501760 | consumed tokens: 33795604480 | elapsed time per iteration (s): 1.03 | learning rate: 1.075E-04 | global batch size: 256 | lm loss: 1.971660E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.084 | TFLOPs: 41.16 | 15: iteration 64470/ 125429 | consumed samples: 16504320 | consumed tokens: 33800847360 | elapsed time per iteration (s): 1.03 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.992060E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.261 | TFLOPs: 41.03 | 15: iteration 64480/ 125429 | consumed samples: 16506880 | consumed tokens: 33806090240 | elapsed time per iteration (s): 1.05 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.975371E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.373 | TFLOPs: 40.38 | 15: iteration 64490/ 125429 | consumed samples: 16509440 | consumed tokens: 33811333120 | elapsed time per iteration (s): 1.04 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.994388E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.745 | TFLOPs: 40.78 | 15: iteration 64500/ 125429 | consumed samples: 16512000 | consumed tokens: 33816576000 | elapsed time per iteration (s): 1.04 | learning rate: 1.074E-04 | global batch size: 256 | lm loss: 1.974330E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.164 | TFLOPs: 40.68 | 15: iteration 64510/ 125429 | consumed samples: 16514560 | consumed tokens: 33821818880 | elapsed time per iteration (s): 1.04 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.980628E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.569 | TFLOPs: 40.58 | 15: iteration 64520/ 125429 | consumed samples: 16517120 | consumed tokens: 33827061760 | elapsed time per iteration (s): 1.02 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.995201E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.615 | TFLOPs: 41.58 | 15: iteration 64530/ 125429 | consumed samples: 16519680 | consumed tokens: 33832304640 | elapsed time per iteration (s): 1.04 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 1.968721E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.714 | TFLOPs: 40.77 | 15: iteration 64540/ 125429 | consumed samples: 16522240 | consumed tokens: 33837547520 | elapsed time per iteration (s): 1.05 | learning rate: 1.073E-04 | global batch size: 256 | lm loss: 2.008381E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.268 | TFLOPs: 40.20 | 15: iteration 64550/ 125429 | consumed samples: 16524800 | consumed tokens: 33842790400 | elapsed time per iteration (s): 1.03 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 1.979293E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.006 | TFLOPs: 41.15 | 15: iteration 64560/ 125429 | consumed samples: 16527360 | consumed tokens: 33848033280 | elapsed time per iteration (s): 1.08 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 1.959966E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.190 | TFLOPs: 39.20 | 15: iteration 64570/ 125429 | consumed samples: 16529920 | consumed tokens: 33853276160 | elapsed time per iteration (s): 1.03 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 2.000887E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.789 | TFLOPs: 41.11 | 15: iteration 64580/ 125429 | consumed samples: 16532480 | consumed tokens: 33858519040 | elapsed time per iteration (s): 1.04 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 1.982484E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.015 | TFLOPs: 40.82 | 15: iteration 64590/ 125429 | consumed samples: 16535040 | consumed tokens: 33863761920 | elapsed time per iteration (s): 1.05 | learning rate: 1.072E-04 | global batch size: 256 | lm loss: 1.939095E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.857 | TFLOPs: 40.30 | 15: iteration 64600/ 125429 | consumed samples: 16537600 | consumed tokens: 33869004800 | elapsed time per iteration (s): 1.05 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.969188E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.704 | TFLOPs: 40.11 | 15: iteration 64610/ 125429 | consumed samples: 16540160 | consumed tokens: 33874247680 | elapsed time per iteration (s): 1.04 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.966035E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.803 | TFLOPs: 40.62 | 15: iteration 64620/ 125429 | consumed samples: 16542720 | consumed tokens: 33879490560 | elapsed time per iteration (s): 1.04 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.956164E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.645 | TFLOPs: 40.76 | 15: iteration 64630/ 125429 | consumed samples: 16545280 | consumed tokens: 33884733440 | elapsed time per iteration (s): 1.05 | learning rate: 1.071E-04 | global batch size: 256 | lm loss: 1.975999E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.007 | TFLOPs: 40.32 | 15: iteration 64640/ 125429 | consumed samples: 16547840 | consumed tokens: 33889976320 | elapsed time per iteration (s): 1.04 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 1.979321E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.237 | TFLOPs: 40.69 | 15: iteration 64650/ 125429 | consumed samples: 16550400 | consumed tokens: 33895219200 | elapsed time per iteration (s): 1.02 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 1.998253E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.671 | TFLOPs: 41.43 | 15: iteration 64660/ 125429 | consumed samples: 16552960 | consumed tokens: 33900462080 | elapsed time per iteration (s): 1.08 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 1.988995E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.676 | TFLOPs: 39.28 | 15: iteration 64670/ 125429 | consumed samples: 16555520 | consumed tokens: 33905704960 | elapsed time per iteration (s): 1.05 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 1.945169E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.956 | TFLOPs: 40.48 | 15: iteration 64680/ 125429 | consumed samples: 16558080 | consumed tokens: 33910947840 | elapsed time per iteration (s): 1.15 | learning rate: 1.070E-04 | global batch size: 256 | lm loss: 1.971962E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.784 | TFLOPs: 36.82 | 15: iteration 64690/ 125429 | consumed samples: 16560640 | consumed tokens: 33916190720 | elapsed time per iteration (s): 1.04 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 1.988288E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.847 | TFLOPs: 40.63 | 15: iteration 64700/ 125429 | consumed samples: 16563200 | consumed tokens: 33921433600 | elapsed time per iteration (s): 1.03 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 2.004985E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.444 | TFLOPs: 40.89 | 15: iteration 64710/ 125429 | consumed samples: 16565760 | consumed tokens: 33926676480 | elapsed time per iteration (s): 1.06 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 1.992817E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.770 | TFLOPs: 39.79 | 15: iteration 64720/ 125429 | consumed samples: 16568320 | consumed tokens: 33931919360 | elapsed time per iteration (s): 1.04 | learning rate: 1.069E-04 | global batch size: 256 | lm loss: 1.945973E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.727 | TFLOPs: 40.61 | 15: iteration 64730/ 125429 | consumed samples: 16570880 | consumed tokens: 33937162240 | elapsed time per iteration (s): 1.04 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 1.970348E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.676 | TFLOPs: 40.77 | 15: iteration 64740/ 125429 | consumed samples: 16573440 | consumed tokens: 33942405120 | elapsed time per iteration (s): 1.02 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 1.965873E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.360 | TFLOPs: 41.37 | 15: iteration 64750/ 125429 | consumed samples: 16576000 | consumed tokens: 33947648000 | elapsed time per iteration (s): 1.06 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 1.960023E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.676 | TFLOPs: 39.77 | 15: iteration 64760/ 125429 | consumed samples: 16578560 | consumed tokens: 33952890880 | elapsed time per iteration (s): 1.03 | learning rate: 1.068E-04 | global batch size: 256 | lm loss: 1.966781E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.885 | TFLOPs: 41.13 | 15: iteration 64770/ 125429 | consumed samples: 16581120 | consumed tokens: 33958133760 | elapsed time per iteration (s): 1.04 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.980153E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.548 | TFLOPs: 40.58 | 15: iteration 64780/ 125429 | consumed samples: 16583680 | consumed tokens: 33963376640 | elapsed time per iteration (s): 1.06 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.976735E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.017 | TFLOPs: 40.00 | 15: iteration 64790/ 125429 | consumed samples: 16586240 | consumed tokens: 33968619520 | elapsed time per iteration (s): 1.08 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.971981E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.880 | TFLOPs: 39.15 | 15: iteration 64800/ 125429 | consumed samples: 16588800 | consumed tokens: 33973862400 | elapsed time per iteration (s): 1.03 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.960068E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.668 | TFLOPs: 41.09 | 15: iteration 64810/ 125429 | consumed samples: 16591360 | consumed tokens: 33979105280 | elapsed time per iteration (s): 1.03 | learning rate: 1.067E-04 | global batch size: 256 | lm loss: 1.978370E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.738 | TFLOPs: 40.94 | 15: iteration 64820/ 125429 | consumed samples: 16593920 | consumed tokens: 33984348160 | elapsed time per iteration (s): 1.05 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 1.984053E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.952 | TFLOPs: 40.48 | 15: iteration 64830/ 125429 | consumed samples: 16596480 | consumed tokens: 33989591040 | elapsed time per iteration (s): 1.03 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 1.973419E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.229 | TFLOPs: 41.02 | 15: iteration 64840/ 125429 | consumed samples: 16599040 | consumed tokens: 33994833920 | elapsed time per iteration (s): 1.05 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 1.942777E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.078 | TFLOPs: 40.17 | 15: iteration 64850/ 125429 | consumed samples: 16601600 | consumed tokens: 34000076800 | elapsed time per iteration (s): 1.05 | learning rate: 1.066E-04 | global batch size: 256 | lm loss: 1.979075E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.104 | TFLOPs: 40.17 | 15: iteration 64860/ 125429 | consumed samples: 16604160 | consumed tokens: 34005319680 | elapsed time per iteration (s): 1.05 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.970569E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.822 | TFLOPs: 40.46 | 15: iteration 64870/ 125429 | consumed samples: 16606720 | consumed tokens: 34010562560 | elapsed time per iteration (s): 1.07 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.995092E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.753 | TFLOPs: 39.46 | 15: iteration 64880/ 125429 | consumed samples: 16609280 | consumed tokens: 34015805440 | elapsed time per iteration (s): 1.03 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.954622E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.156 | TFLOPs: 41.01 | 15: iteration 64890/ 125429 | consumed samples: 16611840 | consumed tokens: 34021048320 | elapsed time per iteration (s): 1.09 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.993083E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.200 | TFLOPs: 38.70 | 15: iteration 64900/ 125429 | consumed samples: 16614400 | consumed tokens: 34026291200 | elapsed time per iteration (s): 1.03 | learning rate: 1.065E-04 | global batch size: 256 | lm loss: 1.951407E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.899 | TFLOPs: 40.97 | 15: iteration 64910/ 125429 | consumed samples: 16616960 | consumed tokens: 34031534080 | elapsed time per iteration (s): 1.04 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.992137E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.153 | TFLOPs: 40.84 | 15: iteration 64920/ 125429 | consumed samples: 16619520 | consumed tokens: 34036776960 | elapsed time per iteration (s): 1.04 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.984838E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.325 | TFLOPs: 40.87 | 15: iteration 64930/ 125429 | consumed samples: 16622080 | consumed tokens: 34042019840 | elapsed time per iteration (s): 1.07 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.963320E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.798 | TFLOPs: 39.63 | 15: iteration 64940/ 125429 | consumed samples: 16624640 | consumed tokens: 34047262720 | elapsed time per iteration (s): 1.02 | learning rate: 1.064E-04 | global batch size: 256 | lm loss: 1.993905E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.829 | TFLOPs: 41.45 | 15: iteration 64950/ 125429 | consumed samples: 16627200 | consumed tokens: 34052505600 | elapsed time per iteration (s): 1.03 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.985112E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.814 | TFLOPs: 40.95 | 15: iteration 64960/ 125429 | consumed samples: 16629760 | consumed tokens: 34057748480 | elapsed time per iteration (s): 1.02 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.958843E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.301 | TFLOPs: 41.36 | 15: iteration 64970/ 125429 | consumed samples: 16632320 | consumed tokens: 34062991360 | elapsed time per iteration (s): 1.03 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 2.004248E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.096 | TFLOPs: 41.00 | 15: iteration 64980/ 125429 | consumed samples: 16634880 | consumed tokens: 34068234240 | elapsed time per iteration (s): 1.02 | learning rate: 1.063E-04 | global batch size: 256 | lm loss: 1.948697E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.616 | TFLOPs: 41.58 | 15: iteration 64990/ 125429 | consumed samples: 16637440 | consumed tokens: 34073477120 | elapsed time per iteration (s): 1.06 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.997473E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.251 | TFLOPs: 39.87 | 15: iteration 65000/ 125429 | consumed samples: 16640000 | consumed tokens: 34078720000 | elapsed time per iteration (s): 1.03 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.957758E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.400 | TFLOPs: 41.22 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 65000 | lm loss value: 1.889791E+00 | lm loss PPL: 6.617989E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 65000 to checkpoints_1b5 0: [2022-11-26 15:19:59,396] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step65000 is begin to save! 0: [2022-11-26 15:19:59,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:19:59,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:19:59,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:19:59,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:19:59,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:19:59,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:19:59,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:19:59,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:19:59,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:20:00,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:20:00,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:20:00,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:20:00,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:20:00,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:20:00,266] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:20:00,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:20:00,372] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:20:00,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:20:00,474] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:20:00,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:20:00,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:20:00,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:20:00,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:20:00,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:20:00,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:20:00,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:20:00,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:20:00,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:20:01,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:20:01,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:20:01,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:20:01,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:20:01,207] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:20:01,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:20:01,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:20:01,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:20:01,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:20:01,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:20:01,522] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:20:01,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:20:01,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:20:01,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:20:01,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:20:01,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:20:01,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:20:01,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:20:01,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:20:02,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:20:02,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:20:02,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:20:02,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:20:02,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:20:02,252] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:20:02,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:20:02,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_29-model_00-model_states.pt... 0: [2022-11-26 15:20:02,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_29-model_00-model_states.pt. 0: [2022-11-26 15:20:02,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:20:02,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:20:02,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/layer_32-model_00-model_states.pt... 0: [2022-11-26 15:20:02,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/layer_32-model_00-model_states.pt. 0: [2022-11-26 15:20:02,571] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step65000/mp_rank_00_model_states.pt 0: [2022-11-26 15:20:02,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:20:02,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:20:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:20:02,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step65000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:20:02,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 15:20:02,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 15:20:02,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 15:20:02,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 15:20:02,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 15:20:02,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 15:20:02,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 15:20:02,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:20:02,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 0: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 0: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:20:02,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:20:02,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 15:20:02,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:20:02,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 15:20:02,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 15:20:02,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 15:20:02,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 15:20:02,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 15:20:02,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:20:02,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:20:02,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:20:02,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 15:20:02,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:20:02,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 15:20:02,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:20:02,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:20:02,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 8: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 7: [2022-11-26 15:20:02,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 15:20:02,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 15:20:02,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:20:02,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 15:20:02,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 15:20:02,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:20:02,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:20:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:20:02,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:20:02,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 15:20:02,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 15:20:02,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 15:20:02,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 9: [2022-11-26 15:20:02,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 7: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:20:02,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 0: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 2: [2022-11-26 15:20:02,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 15: [2022-11-26 15:20:02,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:20:02,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:20:02,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 3: [2022-11-26 15:20:02,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:20:02,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:20:02,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:20:02,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:20:02,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 14: [2022-11-26 15:20:02,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:20:02,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:20:02,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:20:02,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:20:02,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:20:02,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 15:20:02,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 15:20:02,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 15:20:02,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 12: [2022-11-26 15:20:02,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:20:02,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 15:20:02,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 15:20:02,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 15:20:02,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 15:20:02,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 6: [2022-11-26 15:20:02,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:20:02,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:20:02,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:20:02,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 10: [2022-11-26 15:20:02,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 15:20:02,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:20:02,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:20:02,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 1: [2022-11-26 15:20:02,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:20:02,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 15:20:02,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: [2022-11-26 15:20:02,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:20:02,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 5: [2022-11-26 15:20:02,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:20:02,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:20:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 4: [2022-11-26 15:20:02,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:20:02,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:20:02,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 13: [2022-11-26 15:20:02,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:20:02,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 15:20:02,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 11: [2022-11-26 15:20:02,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:20:02,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step65000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 15:20:02,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step65000 is ready now! 0: successfully saved checkpoint at iteration 65000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3499.92 15: iteration 65010/ 125429 | consumed samples: 16642560 | consumed tokens: 34083962880 | elapsed time per iteration (s): 1.44 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.962312E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.229 | TFLOPs: 29.45 | 15: iteration 65020/ 125429 | consumed samples: 16645120 | consumed tokens: 34089205760 | elapsed time per iteration (s): 1.08 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.984648E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.214 | TFLOPs: 39.20 | 15: iteration 65030/ 125429 | consumed samples: 16647680 | consumed tokens: 34094448640 | elapsed time per iteration (s): 1.05 | learning rate: 1.062E-04 | global batch size: 256 | lm loss: 1.958413E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.302 | TFLOPs: 40.21 | 15: iteration 65040/ 125429 | consumed samples: 16650240 | consumed tokens: 34099691520 | elapsed time per iteration (s): 1.03 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.988337E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.627 | TFLOPs: 41.25 | 15: iteration 65050/ 125429 | consumed samples: 16652800 | consumed tokens: 34104934400 | elapsed time per iteration (s): 1.08 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.996403E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.292 | TFLOPs: 39.21 | 15: iteration 65060/ 125429 | consumed samples: 16655360 | consumed tokens: 34110177280 | elapsed time per iteration (s): 1.06 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.988054E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.377 | TFLOPs: 40.05 | 15: iteration 65070/ 125429 | consumed samples: 16657920 | consumed tokens: 34115420160 | elapsed time per iteration (s): 1.04 | learning rate: 1.061E-04 | global batch size: 256 | lm loss: 1.983764E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.154 | TFLOPs: 40.51 | 15: iteration 65080/ 125429 | consumed samples: 16660480 | consumed tokens: 34120663040 | elapsed time per iteration (s): 1.03 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.994810E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.976 | TFLOPs: 41.15 | 15: iteration 65090/ 125429 | consumed samples: 16663040 | consumed tokens: 34125905920 | elapsed time per iteration (s): 1.05 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.985468E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.043 | TFLOPs: 40.16 | 15: iteration 65100/ 125429 | consumed samples: 16665600 | consumed tokens: 34131148800 | elapsed time per iteration (s): 1.06 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.975022E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.424 | TFLOPs: 39.90 | 15: iteration 65110/ 125429 | consumed samples: 16668160 | consumed tokens: 34136391680 | elapsed time per iteration (s): 1.03 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.959275E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.602 | TFLOPs: 40.92 | 15: iteration 65120/ 125429 | consumed samples: 16670720 | consumed tokens: 34141634560 | elapsed time per iteration (s): 1.05 | learning rate: 1.060E-04 | global batch size: 256 | lm loss: 1.932016E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.517 | TFLOPs: 40.41 | 15: iteration 65130/ 125429 | consumed samples: 16673280 | consumed tokens: 34146877440 | elapsed time per iteration (s): 1.03 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.992235E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.929 | TFLOPs: 41.14 | 15: iteration 65140/ 125429 | consumed samples: 16675840 | consumed tokens: 34152120320 | elapsed time per iteration (s): 1.05 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 2.009150E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.221 | TFLOPs: 40.19 | 15: iteration 65150/ 125429 | consumed samples: 16678400 | consumed tokens: 34157363200 | elapsed time per iteration (s): 1.05 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.990329E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.970 | TFLOPs: 40.48 | 15: iteration 65160/ 125429 | consumed samples: 16680960 | consumed tokens: 34162606080 | elapsed time per iteration (s): 1.03 | learning rate: 1.059E-04 | global batch size: 256 | lm loss: 1.980891E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.831 | TFLOPs: 41.12 | 15: iteration 65170/ 125429 | consumed samples: 16683520 | consumed tokens: 34167848960 | elapsed time per iteration (s): 1.03 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.992250E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.062 | TFLOPs: 40.99 | 15: iteration 65180/ 125429 | consumed samples: 16686080 | consumed tokens: 34173091840 | elapsed time per iteration (s): 1.03 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.979335E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.570 | TFLOPs: 40.91 | 15: iteration 65190/ 125429 | consumed samples: 16688640 | consumed tokens: 34178334720 | elapsed time per iteration (s): 1.03 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 1.994928E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.713 | TFLOPs: 41.10 | 15: iteration 65200/ 125429 | consumed samples: 16691200 | consumed tokens: 34183577600 | elapsed time per iteration (s): 1.02 | learning rate: 1.058E-04 | global batch size: 256 | lm loss: 2.001353E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.711 | TFLOPs: 41.43 | 15: iteration 65210/ 125429 | consumed samples: 16693760 | consumed tokens: 34188820480 | elapsed time per iteration (s): 1.03 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.950407E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.566 | TFLOPs: 41.08 | 15: iteration 65220/ 125429 | consumed samples: 16696320 | consumed tokens: 34194063360 | elapsed time per iteration (s): 1.02 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.990891E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.999 | TFLOPs: 41.48 | 15: iteration 65230/ 125429 | consumed samples: 16698880 | consumed tokens: 34199306240 | elapsed time per iteration (s): 1.03 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.984189E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.690 | TFLOPs: 41.10 | 15: iteration 65240/ 125429 | consumed samples: 16701440 | consumed tokens: 34204549120 | elapsed time per iteration (s): 1.03 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.972266E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.917 | TFLOPs: 41.14 | 15: iteration 65250/ 125429 | consumed samples: 16704000 | consumed tokens: 34209792000 | elapsed time per iteration (s): 1.03 | learning rate: 1.057E-04 | global batch size: 256 | lm loss: 1.998340E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.545 | TFLOPs: 41.07 | 15: iteration 65260/ 125429 | consumed samples: 16706560 | consumed tokens: 34215034880 | elapsed time per iteration (s): 1.05 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.995030E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.697 | TFLOPs: 40.27 | 15: iteration 65270/ 125429 | consumed samples: 16709120 | consumed tokens: 34220277760 | elapsed time per iteration (s): 1.03 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.985416E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.239 | TFLOPs: 41.02 | 15: iteration 65280/ 125429 | consumed samples: 16711680 | consumed tokens: 34225520640 | elapsed time per iteration (s): 1.03 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.990105E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.640 | TFLOPs: 40.92 | 15: iteration 65290/ 125429 | consumed samples: 16714240 | consumed tokens: 34230763520 | elapsed time per iteration (s): 1.06 | learning rate: 1.056E-04 | global batch size: 256 | lm loss: 1.984443E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.893 | TFLOPs: 39.97 | 15: iteration 65300/ 125429 | consumed samples: 16716800 | consumed tokens: 34236006400 | elapsed time per iteration (s): 1.03 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.998634E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.637 | TFLOPs: 40.92 | 15: iteration 65310/ 125429 | consumed samples: 16719360 | consumed tokens: 34241249280 | elapsed time per iteration (s): 1.03 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.985646E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.419 | TFLOPs: 40.89 | 15: iteration 65320/ 125429 | consumed samples: 16721920 | consumed tokens: 34246492160 | elapsed time per iteration (s): 1.03 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.977099E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.381 | TFLOPs: 40.88 | 15: iteration 65330/ 125429 | consumed samples: 16724480 | consumed tokens: 34251735040 | elapsed time per iteration (s): 1.03 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.970471E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.211 | TFLOPs: 41.18 | 15: iteration 65340/ 125429 | consumed samples: 16727040 | consumed tokens: 34256977920 | elapsed time per iteration (s): 1.05 | learning rate: 1.055E-04 | global batch size: 256 | lm loss: 1.970147E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.260 | TFLOPs: 40.37 | 15: iteration 65350/ 125429 | consumed samples: 16729600 | consumed tokens: 34262220800 | elapsed time per iteration (s): 1.03 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.966024E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.051 | TFLOPs: 40.99 | 15: iteration 65360/ 125429 | consumed samples: 16732160 | consumed tokens: 34267463680 | elapsed time per iteration (s): 1.08 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.968016E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.002 | TFLOPs: 39.33 | 15: iteration 65370/ 125429 | consumed samples: 16734720 | consumed tokens: 34272706560 | elapsed time per iteration (s): 1.04 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.992870E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.352 | TFLOPs: 40.55 | 15: iteration 65380/ 125429 | consumed samples: 16737280 | consumed tokens: 34277949440 | elapsed time per iteration (s): 1.03 | learning rate: 1.054E-04 | global batch size: 256 | lm loss: 1.994313E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.123 | TFLOPs: 41.17 | 15: iteration 65390/ 125429 | consumed samples: 16739840 | consumed tokens: 34283192320 | elapsed time per iteration (s): 1.02 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 1.956721E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.363 | TFLOPs: 41.54 | 15: iteration 65400/ 125429 | consumed samples: 16742400 | consumed tokens: 34288435200 | elapsed time per iteration (s): 1.04 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 1.969421E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.305 | TFLOPs: 40.70 | 15: iteration 65410/ 125429 | consumed samples: 16744960 | consumed tokens: 34293678080 | elapsed time per iteration (s): 1.02 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 1.971609E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.156 | TFLOPs: 41.34 | 15: iteration 65420/ 125429 | consumed samples: 16747520 | consumed tokens: 34298920960 | elapsed time per iteration (s): 1.05 | learning rate: 1.053E-04 | global batch size: 256 | lm loss: 1.996143E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.528 | TFLOPs: 40.41 | 15: iteration 65430/ 125429 | consumed samples: 16750080 | consumed tokens: 34304163840 | elapsed time per iteration (s): 1.05 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.982166E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.709 | TFLOPs: 40.27 | 15: iteration 65440/ 125429 | consumed samples: 16752640 | consumed tokens: 34309406720 | elapsed time per iteration (s): 1.03 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.994451E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.948 | TFLOPs: 40.98 | 15: iteration 65450/ 125429 | consumed samples: 16755200 | consumed tokens: 34314649600 | elapsed time per iteration (s): 1.05 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.961459E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.394 | TFLOPs: 40.39 | 15: iteration 65460/ 125429 | consumed samples: 16757760 | consumed tokens: 34319892480 | elapsed time per iteration (s): 1.06 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.987853E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.501 | TFLOPs: 39.91 | 15: iteration 65470/ 125429 | consumed samples: 16760320 | consumed tokens: 34325135360 | elapsed time per iteration (s): 1.02 | learning rate: 1.052E-04 | global batch size: 256 | lm loss: 1.957555E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.456 | TFLOPs: 41.39 | 15: iteration 65480/ 125429 | consumed samples: 16762880 | consumed tokens: 34330378240 | elapsed time per iteration (s): 1.43 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 2.002294E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.631 | TFLOPs: 29.69 | 15: iteration 65490/ 125429 | consumed samples: 16765440 | consumed tokens: 34335621120 | elapsed time per iteration (s): 1.04 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 1.980568E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.855 | TFLOPs: 40.63 | 15: iteration 65500/ 125429 | consumed samples: 16768000 | consumed tokens: 34340864000 | elapsed time per iteration (s): 1.03 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 1.983355E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.885 | TFLOPs: 41.13 | 15: iteration 65510/ 125429 | consumed samples: 16770560 | consumed tokens: 34346106880 | elapsed time per iteration (s): 1.02 | learning rate: 1.051E-04 | global batch size: 256 | lm loss: 1.968144E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.978 | TFLOPs: 41.48 | 15: iteration 65520/ 125429 | consumed samples: 16773120 | consumed tokens: 34351349760 | elapsed time per iteration (s): 1.04 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.965583E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.821 | TFLOPs: 40.79 | 15: iteration 65530/ 125429 | consumed samples: 16775680 | consumed tokens: 34356592640 | elapsed time per iteration (s): 1.03 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.970475E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.214 | TFLOPs: 41.02 | 15: iteration 65540/ 125429 | consumed samples: 16778240 | consumed tokens: 34361835520 | elapsed time per iteration (s): 1.04 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.961644E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.543 | TFLOPs: 40.74 | 15: iteration 65550/ 125429 | consumed samples: 16780800 | consumed tokens: 34367078400 | elapsed time per iteration (s): 1.03 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.973431E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.981 | TFLOPs: 40.98 | 15: iteration 65560/ 125429 | consumed samples: 16783360 | consumed tokens: 34372321280 | elapsed time per iteration (s): 1.04 | learning rate: 1.050E-04 | global batch size: 256 | lm loss: 1.961861E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.763 | TFLOPs: 40.78 | 15: iteration 65570/ 125429 | consumed samples: 16785920 | consumed tokens: 34377564160 | elapsed time per iteration (s): 1.04 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 2.002019E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.348 | TFLOPs: 40.55 | 15: iteration 65580/ 125429 | consumed samples: 16788480 | consumed tokens: 34382807040 | elapsed time per iteration (s): 1.05 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.967418E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.267 | TFLOPs: 40.37 | 15: iteration 65590/ 125429 | consumed samples: 16791040 | consumed tokens: 34388049920 | elapsed time per iteration (s): 1.10 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.985973E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.912 | TFLOPs: 38.33 | 15: iteration 65600/ 125429 | consumed samples: 16793600 | consumed tokens: 34393292800 | elapsed time per iteration (s): 1.09 | learning rate: 1.049E-04 | global batch size: 256 | lm loss: 1.979492E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.489 | TFLOPs: 38.75 | 15: iteration 65610/ 125429 | consumed samples: 16796160 | consumed tokens: 34398535680 | elapsed time per iteration (s): 1.05 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.954886E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.906 | TFLOPs: 40.14 | 15: iteration 65620/ 125429 | consumed samples: 16798720 | consumed tokens: 34403778560 | elapsed time per iteration (s): 1.03 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.994356E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.358 | TFLOPs: 41.21 | 15: iteration 65630/ 125429 | consumed samples: 16801280 | consumed tokens: 34409021440 | elapsed time per iteration (s): 1.03 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.950776E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.404 | TFLOPs: 40.89 | 15: iteration 65640/ 125429 | consumed samples: 16803840 | consumed tokens: 34414264320 | elapsed time per iteration (s): 1.04 | learning rate: 1.048E-04 | global batch size: 256 | lm loss: 1.963630E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.291 | TFLOPs: 40.54 | 15: iteration 65650/ 125429 | consumed samples: 16806400 | consumed tokens: 34419507200 | elapsed time per iteration (s): 1.04 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.984755E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.768 | TFLOPs: 40.78 | 15: iteration 65660/ 125429 | consumed samples: 16808960 | consumed tokens: 34424750080 | elapsed time per iteration (s): 1.04 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.968494E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.113 | TFLOPs: 40.51 | 15: iteration 65670/ 125429 | consumed samples: 16811520 | consumed tokens: 34429992960 | elapsed time per iteration (s): 1.02 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.966663E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.533 | TFLOPs: 41.40 | 15: iteration 65680/ 125429 | consumed samples: 16814080 | consumed tokens: 34435235840 | elapsed time per iteration (s): 1.04 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.979530E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.185 | TFLOPs: 40.52 | 15: iteration 65690/ 125429 | consumed samples: 16816640 | consumed tokens: 34440478720 | elapsed time per iteration (s): 1.06 | learning rate: 1.047E-04 | global batch size: 256 | lm loss: 1.973986E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.608 | TFLOPs: 40.09 | 15: iteration 65700/ 125429 | consumed samples: 16819200 | consumed tokens: 34445721600 | elapsed time per iteration (s): 1.06 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.971266E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.999 | TFLOPs: 39.99 | 15: iteration 65710/ 125429 | consumed samples: 16821760 | consumed tokens: 34450964480 | elapsed time per iteration (s): 1.05 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.950558E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.233 | TFLOPs: 40.20 | 15: iteration 65720/ 125429 | consumed samples: 16824320 | consumed tokens: 34456207360 | elapsed time per iteration (s): 1.02 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.965302E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.890 | TFLOPs: 41.46 | 15: iteration 65730/ 125429 | consumed samples: 16826880 | consumed tokens: 34461450240 | elapsed time per iteration (s): 1.03 | learning rate: 1.046E-04 | global batch size: 256 | lm loss: 1.964113E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.375 | TFLOPs: 41.21 | 15: iteration 65740/ 125429 | consumed samples: 16829440 | consumed tokens: 34466693120 | elapsed time per iteration (s): 1.07 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 2.003237E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.158 | TFLOPs: 39.52 | 15: iteration 65750/ 125429 | consumed samples: 16832000 | consumed tokens: 34471936000 | elapsed time per iteration (s): 1.06 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.997301E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.513 | TFLOPs: 39.75 | 15: iteration 65760/ 125429 | consumed samples: 16834560 | consumed tokens: 34477178880 | elapsed time per iteration (s): 1.06 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.988587E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.858 | TFLOPs: 39.80 | 15: iteration 65770/ 125429 | consumed samples: 16837120 | consumed tokens: 34482421760 | elapsed time per iteration (s): 1.02 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.966806E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.678 | TFLOPs: 41.43 | 15: iteration 65780/ 125429 | consumed samples: 16839680 | consumed tokens: 34487664640 | elapsed time per iteration (s): 1.03 | learning rate: 1.045E-04 | global batch size: 256 | lm loss: 1.964047E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.263 | TFLOPs: 41.19 | 15: iteration 65790/ 125429 | consumed samples: 16842240 | consumed tokens: 34492907520 | elapsed time per iteration (s): 1.07 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.986061E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.387 | TFLOPs: 39.40 | 15: iteration 65800/ 125429 | consumed samples: 16844800 | consumed tokens: 34498150400 | elapsed time per iteration (s): 1.02 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.954917E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.879 | TFLOPs: 41.29 | 15: iteration 65810/ 125429 | consumed samples: 16847360 | consumed tokens: 34503393280 | elapsed time per iteration (s): 1.10 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.982973E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.786 | TFLOPs: 38.47 | 15: iteration 65820/ 125429 | consumed samples: 16849920 | consumed tokens: 34508636160 | elapsed time per iteration (s): 1.08 | learning rate: 1.044E-04 | global batch size: 256 | lm loss: 1.988027E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.961 | TFLOPs: 39.16 | 15: iteration 65830/ 125429 | consumed samples: 16852480 | consumed tokens: 34513879040 | elapsed time per iteration (s): 1.04 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.962669E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.918 | TFLOPs: 40.64 | 15: iteration 65840/ 125429 | consumed samples: 16855040 | consumed tokens: 34519121920 | elapsed time per iteration (s): 1.06 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.971248E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.480 | TFLOPs: 39.91 | 15: iteration 65850/ 125429 | consumed samples: 16857600 | consumed tokens: 34524364800 | elapsed time per iteration (s): 1.04 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.963476E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.867 | TFLOPs: 40.63 | 15: iteration 65860/ 125429 | consumed samples: 16860160 | consumed tokens: 34529607680 | elapsed time per iteration (s): 1.05 | learning rate: 1.043E-04 | global batch size: 256 | lm loss: 1.976248E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.466 | TFLOPs: 40.40 | 15: iteration 65870/ 125429 | consumed samples: 16862720 | consumed tokens: 34534850560 | elapsed time per iteration (s): 1.02 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.986603E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.245 | TFLOPs: 41.35 | 15: iteration 65880/ 125429 | consumed samples: 16865280 | consumed tokens: 34540093440 | elapsed time per iteration (s): 1.04 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.992357E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.005 | TFLOPs: 40.82 | 15: iteration 65890/ 125429 | consumed samples: 16867840 | consumed tokens: 34545336320 | elapsed time per iteration (s): 1.05 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.967257E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.773 | TFLOPs: 40.45 | 15: iteration 65900/ 125429 | consumed samples: 16870400 | consumed tokens: 34550579200 | elapsed time per iteration (s): 1.03 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.977116E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.116 | TFLOPs: 41.00 | 15: iteration 65910/ 125429 | consumed samples: 16872960 | consumed tokens: 34555822080 | elapsed time per iteration (s): 1.03 | learning rate: 1.042E-04 | global batch size: 256 | lm loss: 1.951306E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.814 | TFLOPs: 41.12 | 15: iteration 65920/ 125429 | consumed samples: 16875520 | consumed tokens: 34561064960 | elapsed time per iteration (s): 1.05 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 1.998950E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.590 | TFLOPs: 40.42 | 15: iteration 65930/ 125429 | consumed samples: 16878080 | consumed tokens: 34566307840 | elapsed time per iteration (s): 1.03 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 1.975665E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.293 | TFLOPs: 41.03 | 15: iteration 65940/ 125429 | consumed samples: 16880640 | consumed tokens: 34571550720 | elapsed time per iteration (s): 1.06 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 2.001228E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.014 | TFLOPs: 39.83 | 15: iteration 65950/ 125429 | consumed samples: 16883200 | consumed tokens: 34576793600 | elapsed time per iteration (s): 1.06 | learning rate: 1.041E-04 | global batch size: 256 | lm loss: 1.987469E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.810 | TFLOPs: 39.80 | 15: iteration 65960/ 125429 | consumed samples: 16885760 | consumed tokens: 34582036480 | elapsed time per iteration (s): 1.05 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.981894E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.909 | TFLOPs: 40.47 | 15: iteration 65970/ 125429 | consumed samples: 16888320 | consumed tokens: 34587279360 | elapsed time per iteration (s): 1.09 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.983027E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.224 | TFLOPs: 38.71 | 15: iteration 65980/ 125429 | consumed samples: 16890880 | consumed tokens: 34592522240 | elapsed time per iteration (s): 1.05 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.973718E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.312 | TFLOPs: 40.21 | 15: iteration 65990/ 125429 | consumed samples: 16893440 | consumed tokens: 34597765120 | elapsed time per iteration (s): 1.05 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.976029E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.776 | TFLOPs: 40.29 | 0: [2022-11-26 15:37:30,250] [INFO] [logging.py:68:log_dist] [Rank 0] step=66000, skipped=0, lr=[0.00010395160933830851, 0.00010395160933830851, 0.00010395160933830851], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 66000/ 125429 | consumed samples: 16896000 | consumed tokens: 34603008000 | elapsed time per iteration (s): 1.03 | learning rate: 1.040E-04 | global batch size: 256 | lm loss: 1.978606E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.542 | TFLOPs: 41.24 | 0: steps: 66000 loss: 1.9981 iter time (s): 1.044 samples/sec: 245.106 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 66000 | lm loss value: 1.954060E+00 | lm loss PPL: 7.057283E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 66000 to checkpoints_1b5 0: [2022-11-26 15:37:30,597] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step66000 is begin to save! 0: [2022-11-26 15:37:30,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:37:30,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:37:30,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:37:30,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:37:30,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:37:31,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:37:31,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:37:31,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:37:31,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:37:31,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:37:31,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:37:31,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:37:31,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:37:31,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:37:31,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:37:31,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:37:31,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:37:31,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:37:31,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:37:31,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:37:31,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:37:31,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:37:31,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:37:32,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:37:32,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:37:32,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:37:32,140] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:37:32,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:37:32,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:37:32,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:37:32,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:37:32,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:37:32,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:37:32,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:37:32,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:37:32,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:37:32,698] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:37:32,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:37:32,807] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:37:32,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:37:32,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:37:33,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:37:33,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:37:33,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:37:33,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:37:33,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:37:33,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:37:33,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:37:33,361] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:37:33,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:37:33,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:37:33,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:37:33,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:37:33,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:37:33,689] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_29-model_00-model_states.pt... 0: [2022-11-26 15:37:33,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_29-model_00-model_states.pt. 0: [2022-11-26 15:37:33,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:37:33,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:37:33,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/layer_32-model_00-model_states.pt... 0: [2022-11-26 15:37:33,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/layer_32-model_00-model_states.pt. 0: [2022-11-26 15:37:33,905] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step66000/mp_rank_00_model_states.pt 0: [2022-11-26 15:37:33,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:37:33,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:37:33,947] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step66000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:37:34,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 15:37:34,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 15:37:34,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:37:34,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 15:37:34,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 15:37:34,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 15:37:34,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 15:37:34,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:37:34,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 15:37:34,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 11: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:37:34,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:37:34,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 15:37:34,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:37:34,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:37:34,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 15:37:34,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 15:37:34,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:37:34,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 15:37:34,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 15:37:34,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 15:37:34,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 15:37:34,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 15:37:34,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:37:34,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:37:34,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:37:34,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 15:37:34,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:37:34,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 15:37:34,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 7: [2022-11-26 15:37:34,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:37:34,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:37:34,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:37:34,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 15:37:34,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:37:34,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 7: [2022-11-26 15:37:34,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 4: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 15:37:34,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:37:34,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 10: [2022-11-26 15:37:34,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 15:37:34,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 15:37:34,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:37:34,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 12: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:37:34,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 15:37:34,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:37:34,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 9: [2022-11-26 15:37:34,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:37:34,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 15:37:34,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 3: [2022-11-26 15:37:34,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:37:34,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:37:34,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:37:34,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:37:34,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:37:34,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 14: [2022-11-26 15:37:34,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 14: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 6: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:37:34,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 15:37:34,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:37:34,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 15: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:37:34,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 8: [2022-11-26 15:37:34,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:37:34,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:37:34,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:37:34,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 15:37:34,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 11: [2022-11-26 15:37:34,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 15:37:34,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 15:37:34,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 15:37:34,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 15:37:34,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 0: [2022-11-26 15:37:34,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:37:34,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:37:34,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 2: [2022-11-26 15:37:34,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:37:34,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:37:34,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,155] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:37:34,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 13: [2022-11-26 15:37:34,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 15:37:34,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 1: [2022-11-26 15:37:34,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 15:37:34,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: [2022-11-26 15:37:34,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 15:37:34,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step66000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 5: [2022-11-26 15:37:34,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step66000 is ready now! 0: successfully saved checkpoint at iteration 66000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3759.75 15: iteration 66010/ 125429 | consumed samples: 16898560 | consumed tokens: 34608250880 | elapsed time per iteration (s): 1.50 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.987801E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 170.631 | TFLOPs: 28.20 | 15: iteration 66020/ 125429 | consumed samples: 16901120 | consumed tokens: 34613493760 | elapsed time per iteration (s): 1.02 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.961190E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.830 | TFLOPs: 41.29 | 15: iteration 66030/ 125429 | consumed samples: 16903680 | consumed tokens: 34618736640 | elapsed time per iteration (s): 1.03 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.980842E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.969 | TFLOPs: 40.98 | 15: iteration 66040/ 125429 | consumed samples: 16906240 | consumed tokens: 34623979520 | elapsed time per iteration (s): 1.03 | learning rate: 1.039E-04 | global batch size: 256 | lm loss: 1.998083E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.434 | TFLOPs: 41.22 | 15: iteration 66050/ 125429 | consumed samples: 16908800 | consumed tokens: 34629222400 | elapsed time per iteration (s): 1.03 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 2.006696E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.350 | TFLOPs: 41.04 | 15: iteration 66060/ 125429 | consumed samples: 16911360 | consumed tokens: 34634465280 | elapsed time per iteration (s): 1.03 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.969323E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.456 | TFLOPs: 40.89 | 15: iteration 66070/ 125429 | consumed samples: 16913920 | consumed tokens: 34639708160 | elapsed time per iteration (s): 1.03 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 1.957491E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.654 | TFLOPs: 41.26 | 15: iteration 66080/ 125429 | consumed samples: 16916480 | consumed tokens: 34644951040 | elapsed time per iteration (s): 1.05 | learning rate: 1.038E-04 | global batch size: 256 | lm loss: 2.006765E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.834 | TFLOPs: 40.30 | 15: iteration 66090/ 125429 | consumed samples: 16919040 | consumed tokens: 34650193920 | elapsed time per iteration (s): 1.02 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.971730E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.163 | TFLOPs: 41.51 | 15: iteration 66100/ 125429 | consumed samples: 16921600 | consumed tokens: 34655436800 | elapsed time per iteration (s): 1.05 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.986698E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.585 | TFLOPs: 40.42 | 15: iteration 66110/ 125429 | consumed samples: 16924160 | consumed tokens: 34660679680 | elapsed time per iteration (s): 1.04 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.988951E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.542 | TFLOPs: 40.58 | 15: iteration 66120/ 125429 | consumed samples: 16926720 | consumed tokens: 34665922560 | elapsed time per iteration (s): 1.03 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.956300E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.070 | TFLOPs: 41.00 | 15: iteration 66130/ 125429 | consumed samples: 16929280 | consumed tokens: 34671165440 | elapsed time per iteration (s): 1.03 | learning rate: 1.037E-04 | global batch size: 256 | lm loss: 1.969010E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.524 | TFLOPs: 41.24 | 15: iteration 66140/ 125429 | consumed samples: 16931840 | consumed tokens: 34676408320 | elapsed time per iteration (s): 1.03 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 1.974123E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.277 | TFLOPs: 41.19 | 15: iteration 66150/ 125429 | consumed samples: 16934400 | consumed tokens: 34681651200 | elapsed time per iteration (s): 1.04 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 1.960970E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.214 | TFLOPs: 40.85 | 15: iteration 66160/ 125429 | consumed samples: 16936960 | consumed tokens: 34686894080 | elapsed time per iteration (s): 1.02 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 1.986878E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.802 | TFLOPs: 41.28 | 15: iteration 66170/ 125429 | consumed samples: 16939520 | consumed tokens: 34692136960 | elapsed time per iteration (s): 1.05 | learning rate: 1.036E-04 | global batch size: 256 | lm loss: 1.957879E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.637 | TFLOPs: 40.26 | 15: iteration 66180/ 125429 | consumed samples: 16942080 | consumed tokens: 34697379840 | elapsed time per iteration (s): 1.02 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.967706E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.149 | TFLOPs: 41.50 | 15: iteration 66190/ 125429 | consumed samples: 16944640 | consumed tokens: 34702622720 | elapsed time per iteration (s): 1.04 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.958178E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.361 | TFLOPs: 40.71 | 15: iteration 66200/ 125429 | consumed samples: 16947200 | consumed tokens: 34707865600 | elapsed time per iteration (s): 1.03 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.989628E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.161 | TFLOPs: 41.01 | 15: iteration 66210/ 125429 | consumed samples: 16949760 | consumed tokens: 34713108480 | elapsed time per iteration (s): 1.03 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.967846E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.346 | TFLOPs: 40.88 | 15: iteration 66220/ 125429 | consumed samples: 16952320 | consumed tokens: 34718351360 | elapsed time per iteration (s): 1.06 | learning rate: 1.035E-04 | global batch size: 256 | lm loss: 1.988020E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.208 | TFLOPs: 40.03 | 15: iteration 66230/ 125429 | consumed samples: 16954880 | consumed tokens: 34723594240 | elapsed time per iteration (s): 1.04 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 1.982362E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.491 | TFLOPs: 40.57 | 15: iteration 66240/ 125429 | consumed samples: 16957440 | consumed tokens: 34728837120 | elapsed time per iteration (s): 1.15 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 2.009361E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.103 | TFLOPs: 36.70 | 15: iteration 66250/ 125429 | consumed samples: 16960000 | consumed tokens: 34734080000 | elapsed time per iteration (s): 1.04 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 1.968352E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.679 | TFLOPs: 40.77 | 15: iteration 66260/ 125429 | consumed samples: 16962560 | consumed tokens: 34739322880 | elapsed time per iteration (s): 1.04 | learning rate: 1.034E-04 | global batch size: 256 | lm loss: 1.985355E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.616 | TFLOPs: 40.59 | 15: iteration 66270/ 125429 | consumed samples: 16965120 | consumed tokens: 34744565760 | elapsed time per iteration (s): 1.05 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.966641E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.208 | TFLOPs: 40.36 | 15: iteration 66280/ 125429 | consumed samples: 16967680 | consumed tokens: 34749808640 | elapsed time per iteration (s): 1.06 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.948384E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.428 | TFLOPs: 40.06 | 15: iteration 66290/ 125429 | consumed samples: 16970240 | consumed tokens: 34755051520 | elapsed time per iteration (s): 1.07 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.984052E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.570 | TFLOPs: 39.59 | 15: iteration 66300/ 125429 | consumed samples: 16972800 | consumed tokens: 34760294400 | elapsed time per iteration (s): 1.04 | learning rate: 1.033E-04 | global batch size: 256 | lm loss: 1.969314E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.107 | TFLOPs: 40.51 | 15: iteration 66310/ 125429 | consumed samples: 16975360 | consumed tokens: 34765537280 | elapsed time per iteration (s): 1.07 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.955915E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.165 | TFLOPs: 39.69 | 15: iteration 66320/ 125429 | consumed samples: 16977920 | consumed tokens: 34770780160 | elapsed time per iteration (s): 1.09 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.967499E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.269 | TFLOPs: 38.71 | 15: iteration 66330/ 125429 | consumed samples: 16980480 | consumed tokens: 34776023040 | elapsed time per iteration (s): 1.15 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.992727E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.079 | TFLOPs: 36.70 | 15: iteration 66340/ 125429 | consumed samples: 16983040 | consumed tokens: 34781265920 | elapsed time per iteration (s): 1.16 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.964680E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.619 | TFLOPs: 36.46 | 15: iteration 66350/ 125429 | consumed samples: 16985600 | consumed tokens: 34786508800 | elapsed time per iteration (s): 1.17 | learning rate: 1.032E-04 | global batch size: 256 | lm loss: 1.995197E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.115 | TFLOPs: 36.05 | 15: iteration 66360/ 125429 | consumed samples: 16988160 | consumed tokens: 34791751680 | elapsed time per iteration (s): 1.16 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.988938E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.113 | TFLOPs: 36.38 | 15: iteration 66370/ 125429 | consumed samples: 16990720 | consumed tokens: 34796994560 | elapsed time per iteration (s): 1.07 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.980659E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.424 | TFLOPs: 39.57 | 15: iteration 66380/ 125429 | consumed samples: 16993280 | consumed tokens: 34802237440 | elapsed time per iteration (s): 1.05 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 2.005622E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.644 | TFLOPs: 40.26 | 15: iteration 66390/ 125429 | consumed samples: 16995840 | consumed tokens: 34807480320 | elapsed time per iteration (s): 1.06 | learning rate: 1.031E-04 | global batch size: 256 | lm loss: 1.980955E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.347 | TFLOPs: 39.88 | 15: iteration 66400/ 125429 | consumed samples: 16998400 | consumed tokens: 34812723200 | elapsed time per iteration (s): 1.04 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.967253E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.377 | TFLOPs: 40.55 | 15: iteration 66410/ 125429 | consumed samples: 17000960 | consumed tokens: 34817966080 | elapsed time per iteration (s): 1.05 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.969254E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.939 | TFLOPs: 40.48 | 15: iteration 66420/ 125429 | consumed samples: 17003520 | consumed tokens: 34823208960 | elapsed time per iteration (s): 1.05 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.958665E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.650 | TFLOPs: 40.27 | 15: iteration 66430/ 125429 | consumed samples: 17006080 | consumed tokens: 34828451840 | elapsed time per iteration (s): 1.02 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.996300E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.818 | TFLOPs: 41.45 | 15: iteration 66440/ 125429 | consumed samples: 17008640 | consumed tokens: 34833694720 | elapsed time per iteration (s): 1.05 | learning rate: 1.030E-04 | global batch size: 256 | lm loss: 1.997898E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.329 | TFLOPs: 40.21 | 15: iteration 66450/ 125429 | consumed samples: 17011200 | consumed tokens: 34838937600 | elapsed time per iteration (s): 1.04 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.957193E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.169 | TFLOPs: 40.68 | 15: iteration 66460/ 125429 | consumed samples: 17013760 | consumed tokens: 34844180480 | elapsed time per iteration (s): 1.06 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.987494E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.399 | TFLOPs: 39.89 | 15: iteration 66470/ 125429 | consumed samples: 17016320 | consumed tokens: 34849423360 | elapsed time per iteration (s): 1.07 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.966142E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.672 | TFLOPs: 39.61 | 15: iteration 66480/ 125429 | consumed samples: 17018880 | consumed tokens: 34854666240 | elapsed time per iteration (s): 1.03 | learning rate: 1.029E-04 | global batch size: 256 | lm loss: 1.971360E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.605 | TFLOPs: 41.25 | 15: iteration 66490/ 125429 | consumed samples: 17021440 | consumed tokens: 34859909120 | elapsed time per iteration (s): 1.07 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.959640E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.166 | TFLOPs: 39.52 | 15: iteration 66500/ 125429 | consumed samples: 17024000 | consumed tokens: 34865152000 | elapsed time per iteration (s): 1.10 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 2.006839E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.289 | TFLOPs: 38.55 | 15: iteration 66510/ 125429 | consumed samples: 17026560 | consumed tokens: 34870394880 | elapsed time per iteration (s): 1.03 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.981877E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.866 | TFLOPs: 41.13 | 15: iteration 66520/ 125429 | consumed samples: 17029120 | consumed tokens: 34875637760 | elapsed time per iteration (s): 1.03 | learning rate: 1.028E-04 | global batch size: 256 | lm loss: 1.999725E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.394 | TFLOPs: 40.88 | 15: iteration 66530/ 125429 | consumed samples: 17031680 | consumed tokens: 34880880640 | elapsed time per iteration (s): 1.02 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.968923E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.002 | TFLOPs: 41.31 | 15: iteration 66540/ 125429 | consumed samples: 17034240 | consumed tokens: 34886123520 | elapsed time per iteration (s): 1.11 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.994950E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.923 | TFLOPs: 38.16 | 15: iteration 66550/ 125429 | consumed samples: 17036800 | consumed tokens: 34891366400 | elapsed time per iteration (s): 1.05 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.989081E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.932 | TFLOPs: 40.48 | 15: iteration 66560/ 125429 | consumed samples: 17039360 | consumed tokens: 34896609280 | elapsed time per iteration (s): 1.07 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 2.006192E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.732 | TFLOPs: 39.62 | 15: iteration 66570/ 125429 | consumed samples: 17041920 | consumed tokens: 34901852160 | elapsed time per iteration (s): 1.03 | learning rate: 1.027E-04 | global batch size: 256 | lm loss: 1.959869E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.953 | TFLOPs: 40.98 | 15: iteration 66580/ 125429 | consumed samples: 17044480 | consumed tokens: 34907095040 | elapsed time per iteration (s): 1.15 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.970354E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.642 | TFLOPs: 36.79 | 15: iteration 66590/ 125429 | consumed samples: 17047040 | consumed tokens: 34912337920 | elapsed time per iteration (s): 1.03 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.982233E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.488 | TFLOPs: 41.06 | 15: iteration 66600/ 125429 | consumed samples: 17049600 | consumed tokens: 34917580800 | elapsed time per iteration (s): 1.05 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.996535E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.477 | TFLOPs: 40.40 | 15: iteration 66610/ 125429 | consumed samples: 17052160 | consumed tokens: 34922823680 | elapsed time per iteration (s): 1.05 | learning rate: 1.026E-04 | global batch size: 256 | lm loss: 1.965654E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.354 | TFLOPs: 40.38 | 15: iteration 66620/ 125429 | consumed samples: 17054720 | consumed tokens: 34928066560 | elapsed time per iteration (s): 1.19 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 1.997474E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.117 | TFLOPs: 35.55 | 15: iteration 66630/ 125429 | consumed samples: 17057280 | consumed tokens: 34933309440 | elapsed time per iteration (s): 1.09 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 1.984282E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.615 | TFLOPs: 38.94 | 15: iteration 66640/ 125429 | consumed samples: 17059840 | consumed tokens: 34938552320 | elapsed time per iteration (s): 1.05 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 1.976700E+00 | grad norm: 0.918 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.553 | TFLOPs: 40.25 | 15: iteration 66650/ 125429 | consumed samples: 17062400 | consumed tokens: 34943795200 | elapsed time per iteration (s): 1.04 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 2.007379E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.414 | TFLOPs: 40.72 | 15: iteration 66660/ 125429 | consumed samples: 17064960 | consumed tokens: 34949038080 | elapsed time per iteration (s): 1.05 | learning rate: 1.025E-04 | global batch size: 256 | lm loss: 1.974505E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.780 | TFLOPs: 40.29 | 15: iteration 66670/ 125429 | consumed samples: 17067520 | consumed tokens: 34954280960 | elapsed time per iteration (s): 1.03 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 1.973262E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.192 | TFLOPs: 41.02 | 15: iteration 66680/ 125429 | consumed samples: 17070080 | consumed tokens: 34959523840 | elapsed time per iteration (s): 1.04 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 2.026496E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.006 | TFLOPs: 40.49 | 15: iteration 66690/ 125429 | consumed samples: 17072640 | consumed tokens: 34964766720 | elapsed time per iteration (s): 1.05 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 1.987727E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.738 | TFLOPs: 40.44 | 15: iteration 66700/ 125429 | consumed samples: 17075200 | consumed tokens: 34970009600 | elapsed time per iteration (s): 1.06 | learning rate: 1.024E-04 | global batch size: 256 | lm loss: 1.974215E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.281 | TFLOPs: 40.04 | 15: iteration 66710/ 125429 | consumed samples: 17077760 | consumed tokens: 34975252480 | elapsed time per iteration (s): 1.05 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.980284E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.834 | TFLOPs: 40.13 | 15: iteration 66720/ 125429 | consumed samples: 17080320 | consumed tokens: 34980495360 | elapsed time per iteration (s): 1.06 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.976868E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.183 | TFLOPs: 39.86 | 15: iteration 66730/ 125429 | consumed samples: 17082880 | consumed tokens: 34985738240 | elapsed time per iteration (s): 1.04 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.991711E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.589 | TFLOPs: 40.59 | 15: iteration 66740/ 125429 | consumed samples: 17085440 | consumed tokens: 34990981120 | elapsed time per iteration (s): 1.07 | learning rate: 1.023E-04 | global batch size: 256 | lm loss: 1.988756E+00 | grad norm: 0.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.752 | TFLOPs: 39.62 | 15: iteration 66750/ 125429 | consumed samples: 17088000 | consumed tokens: 34996224000 | elapsed time per iteration (s): 1.07 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.989357E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.172 | TFLOPs: 39.69 | 15: iteration 66760/ 125429 | consumed samples: 17090560 | consumed tokens: 35001466880 | elapsed time per iteration (s): 1.05 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.941971E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.598 | TFLOPs: 40.42 | 15: iteration 66770/ 125429 | consumed samples: 17093120 | consumed tokens: 35006709760 | elapsed time per iteration (s): 1.03 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.976746E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.948 | TFLOPs: 40.98 | 15: iteration 66780/ 125429 | consumed samples: 17095680 | consumed tokens: 35011952640 | elapsed time per iteration (s): 1.04 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.979951E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.366 | TFLOPs: 40.71 | 15: iteration 66790/ 125429 | consumed samples: 17098240 | consumed tokens: 35017195520 | elapsed time per iteration (s): 1.05 | learning rate: 1.022E-04 | global batch size: 256 | lm loss: 1.964605E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.450 | TFLOPs: 40.23 | 15: iteration 66800/ 125429 | consumed samples: 17100800 | consumed tokens: 35022438400 | elapsed time per iteration (s): 1.03 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.949257E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.967 | TFLOPs: 40.98 | 15: iteration 66810/ 125429 | consumed samples: 17103360 | consumed tokens: 35027681280 | elapsed time per iteration (s): 1.06 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.974651E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.907 | TFLOPs: 39.98 | 15: iteration 66820/ 125429 | consumed samples: 17105920 | consumed tokens: 35032924160 | elapsed time per iteration (s): 1.03 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.978974E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.056 | TFLOPs: 41.16 | 15: iteration 66830/ 125429 | consumed samples: 17108480 | consumed tokens: 35038167040 | elapsed time per iteration (s): 1.04 | learning rate: 1.021E-04 | global batch size: 256 | lm loss: 1.990589E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.405 | TFLOPs: 40.56 | 15: iteration 66840/ 125429 | consumed samples: 17111040 | consumed tokens: 35043409920 | elapsed time per iteration (s): 1.11 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.977561E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.750 | TFLOPs: 37.97 | 15: iteration 66850/ 125429 | consumed samples: 17113600 | consumed tokens: 35048652800 | elapsed time per iteration (s): 1.03 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.968341E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.107 | TFLOPs: 41.17 | 15: iteration 66860/ 125429 | consumed samples: 17116160 | consumed tokens: 35053895680 | elapsed time per iteration (s): 1.04 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.963345E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.329 | TFLOPs: 40.71 | 15: iteration 66870/ 125429 | consumed samples: 17118720 | consumed tokens: 35059138560 | elapsed time per iteration (s): 1.03 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.980973E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.733 | TFLOPs: 41.27 | 15: iteration 66880/ 125429 | consumed samples: 17121280 | consumed tokens: 35064381440 | elapsed time per iteration (s): 1.07 | learning rate: 1.020E-04 | global batch size: 256 | lm loss: 1.962217E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.713 | TFLOPs: 39.45 | 15: iteration 66890/ 125429 | consumed samples: 17123840 | consumed tokens: 35069624320 | elapsed time per iteration (s): 1.02 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.963667E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.210 | TFLOPs: 41.35 | 15: iteration 66900/ 125429 | consumed samples: 17126400 | consumed tokens: 35074867200 | elapsed time per iteration (s): 1.03 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 2.023058E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.554 | TFLOPs: 40.91 | 15: iteration 66910/ 125429 | consumed samples: 17128960 | consumed tokens: 35080110080 | elapsed time per iteration (s): 1.03 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.989548E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.329 | TFLOPs: 41.20 | 15: iteration 66920/ 125429 | consumed samples: 17131520 | consumed tokens: 35085352960 | elapsed time per iteration (s): 1.05 | learning rate: 1.019E-04 | global batch size: 256 | lm loss: 1.983768E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.856 | TFLOPs: 40.13 | 15: iteration 66930/ 125429 | consumed samples: 17134080 | consumed tokens: 35090595840 | elapsed time per iteration (s): 1.07 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.974761E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.593 | TFLOPs: 39.43 | 15: iteration 66940/ 125429 | consumed samples: 17136640 | consumed tokens: 35095838720 | elapsed time per iteration (s): 1.04 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.984047E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.652 | TFLOPs: 40.76 | 15: iteration 66950/ 125429 | consumed samples: 17139200 | consumed tokens: 35101081600 | elapsed time per iteration (s): 1.05 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 2.023108E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.254 | TFLOPs: 40.20 | 15: iteration 66960/ 125429 | consumed samples: 17141760 | consumed tokens: 35106324480 | elapsed time per iteration (s): 1.02 | learning rate: 1.018E-04 | global batch size: 256 | lm loss: 1.991751E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.400 | TFLOPs: 41.55 | 15: iteration 66970/ 125429 | consumed samples: 17144320 | consumed tokens: 35111567360 | elapsed time per iteration (s): 1.04 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.960229E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.951 | TFLOPs: 40.65 | 15: iteration 66980/ 125429 | consumed samples: 17146880 | consumed tokens: 35116810240 | elapsed time per iteration (s): 1.02 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.990848E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.788 | TFLOPs: 41.44 | 15: iteration 66990/ 125429 | consumed samples: 17149440 | consumed tokens: 35122053120 | elapsed time per iteration (s): 1.05 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 2.016901E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.924 | TFLOPs: 40.31 | 15: iteration 67000/ 125429 | consumed samples: 17152000 | consumed tokens: 35127296000 | elapsed time per iteration (s): 1.03 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.988968E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.416 | TFLOPs: 41.05 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 67000 | lm loss value: 1.940704E+00 | lm loss PPL: 6.963648E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 67000 to checkpoints_1b5 0: [2022-11-26 15:55:09,048] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step67000 is begin to save! 0: [2022-11-26 15:55:09,057] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_01-model_00-model_states.pt... 0: [2022-11-26 15:55:09,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_01-model_00-model_states.pt. 0: [2022-11-26 15:55:09,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_03-model_00-model_states.pt... 0: [2022-11-26 15:55:09,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_03-model_00-model_states.pt. 0: [2022-11-26 15:55:09,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_04-model_00-model_states.pt... 0: [2022-11-26 15:55:09,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_04-model_00-model_states.pt. 0: [2022-11-26 15:55:09,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_05-model_00-model_states.pt... 0: [2022-11-26 15:55:09,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_05-model_00-model_states.pt. 0: [2022-11-26 15:55:09,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_06-model_00-model_states.pt... 0: [2022-11-26 15:55:09,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_06-model_00-model_states.pt. 0: [2022-11-26 15:55:09,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_07-model_00-model_states.pt... 0: [2022-11-26 15:55:09,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_07-model_00-model_states.pt. 0: [2022-11-26 15:55:09,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_08-model_00-model_states.pt... 0: [2022-11-26 15:55:09,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_08-model_00-model_states.pt. 0: [2022-11-26 15:55:09,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_09-model_00-model_states.pt... 0: [2022-11-26 15:55:10,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_09-model_00-model_states.pt. 0: [2022-11-26 15:55:10,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_10-model_00-model_states.pt... 0: [2022-11-26 15:55:10,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_10-model_00-model_states.pt. 0: [2022-11-26 15:55:10,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_11-model_00-model_states.pt... 0: [2022-11-26 15:55:10,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_11-model_00-model_states.pt. 0: [2022-11-26 15:55:10,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_12-model_00-model_states.pt... 0: [2022-11-26 15:55:10,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_12-model_00-model_states.pt. 0: [2022-11-26 15:55:10,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_13-model_00-model_states.pt... 0: [2022-11-26 15:55:10,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_13-model_00-model_states.pt. 0: [2022-11-26 15:55:10,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_14-model_00-model_states.pt... 0: [2022-11-26 15:55:10,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_14-model_00-model_states.pt. 0: [2022-11-26 15:55:10,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_15-model_00-model_states.pt... 0: [2022-11-26 15:55:10,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_15-model_00-model_states.pt. 0: [2022-11-26 15:55:10,746] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_16-model_00-model_states.pt... 0: [2022-11-26 15:55:10,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_16-model_00-model_states.pt. 0: [2022-11-26 15:55:10,850] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_17-model_00-model_states.pt... 0: [2022-11-26 15:55:10,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_17-model_00-model_states.pt. 0: [2022-11-26 15:55:10,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_18-model_00-model_states.pt... 0: [2022-11-26 15:55:11,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_18-model_00-model_states.pt. 0: [2022-11-26 15:55:11,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_19-model_00-model_states.pt... 0: [2022-11-26 15:55:11,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_19-model_00-model_states.pt. 0: [2022-11-26 15:55:11,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_20-model_00-model_states.pt... 0: [2022-11-26 15:55:11,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_20-model_00-model_states.pt. 0: [2022-11-26 15:55:11,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_21-model_00-model_states.pt... 0: [2022-11-26 15:55:11,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_21-model_00-model_states.pt. 0: [2022-11-26 15:55:11,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_22-model_00-model_states.pt... 0: [2022-11-26 15:55:11,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_22-model_00-model_states.pt. 0: [2022-11-26 15:55:11,522] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_23-model_00-model_states.pt... 0: [2022-11-26 15:55:11,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_23-model_00-model_states.pt. 0: [2022-11-26 15:55:11,630] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_24-model_00-model_states.pt... 0: [2022-11-26 15:55:11,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_24-model_00-model_states.pt. 0: [2022-11-26 15:55:11,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_25-model_00-model_states.pt... 0: [2022-11-26 15:55:11,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_25-model_00-model_states.pt. 0: [2022-11-26 15:55:11,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_26-model_00-model_states.pt... 0: [2022-11-26 15:55:11,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_26-model_00-model_states.pt. 0: [2022-11-26 15:55:11,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_27-model_00-model_states.pt... 0: [2022-11-26 15:55:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_27-model_00-model_states.pt. 0: [2022-11-26 15:55:12,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_28-model_00-model_states.pt... 0: [2022-11-26 15:55:12,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_28-model_00-model_states.pt. 0: [2022-11-26 15:55:12,199] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_29-model_00-model_states.pt... 0: [2022-11-26 15:55:12,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_29-model_00-model_states.pt. 0: [2022-11-26 15:55:12,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_30-model_00-model_states.pt... 0: [2022-11-26 15:55:12,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_30-model_00-model_states.pt. 0: [2022-11-26 15:55:12,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/layer_32-model_00-model_states.pt... 0: [2022-11-26 15:55:12,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/layer_32-model_00-model_states.pt. 0: [2022-11-26 15:55:12,421] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step67000/mp_rank_00_model_states.pt 0: [2022-11-26 15:55:12,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/mp_rank_00_model_states.pt... 0: [2022-11-26 15:55:12,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/mp_rank_00_model_states.pt. 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 1: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 6: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 2: [2022-11-26 15:55:12,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step67000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 7: [2022-11-26 15:55:12,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 15:55:12,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 15:55:12,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 15:55:12,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,651] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,651] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,651] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 15:55:12,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 15:55:12,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 15:55:12,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 15:55:12,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 7: [2022-11-26 15:55:12,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 15:55:12,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 15:55:12,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 15:55:12,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 15:55:12,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 15:55:12,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 15:55:12,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 15:55:12,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 15:55:12,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 8: [2022-11-26 15:55:12,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 15:55:12,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 15:55:12,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 11: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 15:55:12,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 15:55:12,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: [2022-11-26 15:55:12,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 15:55:12,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 15:55:12,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 14: [2022-11-26 15:55:12,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 4: [2022-11-26 15:55:12,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 15:55:12,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 15:55:12,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 13: [2022-11-26 15:55:12,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 15:55:12,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 15:55:12,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 15:55:12,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 15:55:12,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 15:55:12,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 15:55:12,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 3: [2022-11-26 15:55:12,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 6: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 15:55:12,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 15:55:12,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 1: [2022-11-26 15:55:12,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 15:55:12,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 15:55:12,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 10: [2022-11-26 15:55:12,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 15:55:12,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 15:55:12,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 0: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 0: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 12: [2022-11-26 15:55:12,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:55:12,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 15:55:12,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 15:55:12,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 15:55:12,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 2: [2022-11-26 15:55:12,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 15:55:12,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 15:55:12,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 15:55:12,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 9: [2022-11-26 15:55:12,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:55:12,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 15: [2022-11-26 15:55:12,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 15:55:12,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 15:55:12,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 15:55:12,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 15:55:12,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 5: [2022-11-26 15:55:12,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 15:55:12,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step67000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 15:55:12,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step67000 is ready now! 0: successfully saved checkpoint at iteration 67000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3836.74 15: iteration 67010/ 125429 | consumed samples: 17154560 | consumed tokens: 35132538880 | elapsed time per iteration (s): 1.47 | learning rate: 1.017E-04 | global batch size: 256 | lm loss: 1.989400E+00 | grad norm: 0.194 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.448 | TFLOPs: 28.83 | 15: iteration 67020/ 125429 | consumed samples: 17157120 | consumed tokens: 35137781760 | elapsed time per iteration (s): 1.04 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.977937E+00 | grad norm: 0.280 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.654 | TFLOPs: 40.60 | 15: iteration 67030/ 125429 | consumed samples: 17159680 | consumed tokens: 35143024640 | elapsed time per iteration (s): 1.04 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.989661E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.332 | TFLOPs: 40.71 | 15: iteration 67040/ 125429 | consumed samples: 17162240 | consumed tokens: 35148267520 | elapsed time per iteration (s): 1.09 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.990662E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.042 | TFLOPs: 38.68 | 15: iteration 67050/ 125429 | consumed samples: 17164800 | consumed tokens: 35153510400 | elapsed time per iteration (s): 1.03 | learning rate: 1.016E-04 | global batch size: 256 | lm loss: 1.959184E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.936 | TFLOPs: 41.14 | 15: iteration 67060/ 125429 | consumed samples: 17167360 | consumed tokens: 35158753280 | elapsed time per iteration (s): 1.04 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.977155E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.570 | TFLOPs: 40.58 | 15: iteration 67070/ 125429 | consumed samples: 17169920 | consumed tokens: 35163996160 | elapsed time per iteration (s): 1.03 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.976471E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.449 | TFLOPs: 40.89 | 15: iteration 67080/ 125429 | consumed samples: 17172480 | consumed tokens: 35169239040 | elapsed time per iteration (s): 1.04 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.970718E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.486 | TFLOPs: 40.73 | 15: iteration 67090/ 125429 | consumed samples: 17175040 | consumed tokens: 35174481920 | elapsed time per iteration (s): 1.04 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 1.981860E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.340 | TFLOPs: 40.87 | 15: iteration 67100/ 125429 | consumed samples: 17177600 | consumed tokens: 35179724800 | elapsed time per iteration (s): 1.03 | learning rate: 1.015E-04 | global batch size: 256 | lm loss: 2.007397E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.198 | TFLOPs: 41.02 | 15: iteration 67110/ 125429 | consumed samples: 17180160 | consumed tokens: 35184967680 | elapsed time per iteration (s): 1.06 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.943333E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.770 | TFLOPs: 39.95 | 15: iteration 67120/ 125429 | consumed samples: 17182720 | consumed tokens: 35190210560 | elapsed time per iteration (s): 1.04 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.988321E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.064 | TFLOPs: 40.83 | 15: iteration 67130/ 125429 | consumed samples: 17185280 | consumed tokens: 35195453440 | elapsed time per iteration (s): 1.03 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 2.001103E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.339 | TFLOPs: 41.21 | 15: iteration 67140/ 125429 | consumed samples: 17187840 | consumed tokens: 35200696320 | elapsed time per iteration (s): 1.05 | learning rate: 1.014E-04 | global batch size: 256 | lm loss: 1.958835E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.852 | TFLOPs: 40.30 | 15: iteration 67150/ 125429 | consumed samples: 17190400 | consumed tokens: 35205939200 | elapsed time per iteration (s): 1.05 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.957974E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.666 | TFLOPs: 40.43 | 15: iteration 67160/ 125429 | consumed samples: 17192960 | consumed tokens: 35211182080 | elapsed time per iteration (s): 1.06 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.966037E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.561 | TFLOPs: 40.09 | 15: iteration 67170/ 125429 | consumed samples: 17195520 | consumed tokens: 35216424960 | elapsed time per iteration (s): 1.06 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.986839E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.020 | TFLOPs: 40.00 | 15: iteration 67180/ 125429 | consumed samples: 17198080 | consumed tokens: 35221667840 | elapsed time per iteration (s): 4.73 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.981690E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 54.093 | TFLOPs: 8.94 | 15: iteration 67190/ 125429 | consumed samples: 17200640 | consumed tokens: 35226910720 | elapsed time per iteration (s): 1.05 | learning rate: 1.013E-04 | global batch size: 256 | lm loss: 1.948285E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.337 | TFLOPs: 40.38 | 15: iteration 67200/ 125429 | consumed samples: 17203200 | consumed tokens: 35232153600 | elapsed time per iteration (s): 1.05 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 1.998526E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.315 | TFLOPs: 40.37 | 15: iteration 67210/ 125429 | consumed samples: 17205760 | consumed tokens: 35237396480 | elapsed time per iteration (s): 1.08 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 1.981690E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.797 | TFLOPs: 39.13 | 15: iteration 67220/ 125429 | consumed samples: 17208320 | consumed tokens: 35242639360 | elapsed time per iteration (s): 1.03 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 1.979004E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.252 | TFLOPs: 41.19 | 15: iteration 67230/ 125429 | consumed samples: 17210880 | consumed tokens: 35247882240 | elapsed time per iteration (s): 1.04 | learning rate: 1.012E-04 | global batch size: 256 | lm loss: 2.008083E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.756 | TFLOPs: 40.61 | 15: iteration 67240/ 125429 | consumed samples: 17213440 | consumed tokens: 35253125120 | elapsed time per iteration (s): 1.05 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.962285E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.942 | TFLOPs: 40.31 | 15: iteration 67250/ 125429 | consumed samples: 17216000 | consumed tokens: 35258368000 | elapsed time per iteration (s): 1.08 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.982636E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.053 | TFLOPs: 39.17 | 15: iteration 67260/ 125429 | consumed samples: 17218560 | consumed tokens: 35263610880 | elapsed time per iteration (s): 1.05 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.986222E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.898 | TFLOPs: 40.14 | 15: iteration 67270/ 125429 | consumed samples: 17221120 | consumed tokens: 35268853760 | elapsed time per iteration (s): 1.04 | learning rate: 1.011E-04 | global batch size: 256 | lm loss: 1.973079E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.763 | TFLOPs: 40.78 | 15: iteration 67280/ 125429 | consumed samples: 17223680 | consumed tokens: 35274096640 | elapsed time per iteration (s): 1.06 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 1.954695E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.903 | TFLOPs: 39.81 | 15: iteration 67290/ 125429 | consumed samples: 17226240 | consumed tokens: 35279339520 | elapsed time per iteration (s): 1.10 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 1.979036E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.005 | TFLOPs: 38.51 | 15: iteration 67300/ 125429 | consumed samples: 17228800 | consumed tokens: 35284582400 | elapsed time per iteration (s): 1.03 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 1.966195E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.534 | TFLOPs: 41.24 | 15: iteration 67310/ 125429 | consumed samples: 17231360 | consumed tokens: 35289825280 | elapsed time per iteration (s): 1.02 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 1.952415E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.567 | TFLOPs: 41.57 | 15: iteration 67320/ 125429 | consumed samples: 17233920 | consumed tokens: 35295068160 | elapsed time per iteration (s): 1.04 | learning rate: 1.010E-04 | global batch size: 256 | lm loss: 2.004033E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.919 | TFLOPs: 40.64 | 15: iteration 67330/ 125429 | consumed samples: 17236480 | consumed tokens: 35300311040 | elapsed time per iteration (s): 1.04 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.972196E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.213 | TFLOPs: 40.69 | 15: iteration 67340/ 125429 | consumed samples: 17239040 | consumed tokens: 35305553920 | elapsed time per iteration (s): 1.04 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.982121E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.357 | TFLOPs: 40.71 | 15: iteration 67350/ 125429 | consumed samples: 17241600 | consumed tokens: 35310796800 | elapsed time per iteration (s): 1.03 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.975564E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.347 | TFLOPs: 40.88 | 15: iteration 67360/ 125429 | consumed samples: 17244160 | consumed tokens: 35316039680 | elapsed time per iteration (s): 1.07 | learning rate: 1.009E-04 | global batch size: 256 | lm loss: 1.982578E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.019 | TFLOPs: 39.66 | 15: iteration 67370/ 125429 | consumed samples: 17246720 | consumed tokens: 35321282560 | elapsed time per iteration (s): 1.05 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.967295E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.214 | TFLOPs: 40.36 | 15: iteration 67380/ 125429 | consumed samples: 17249280 | consumed tokens: 35326525440 | elapsed time per iteration (s): 1.05 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.976128E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.374 | TFLOPs: 40.38 | 15: iteration 67390/ 125429 | consumed samples: 17251840 | consumed tokens: 35331768320 | elapsed time per iteration (s): 1.08 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.971521E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.818 | TFLOPs: 39.14 | 15: iteration 67400/ 125429 | consumed samples: 17254400 | consumed tokens: 35337011200 | elapsed time per iteration (s): 1.04 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.941940E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.253 | TFLOPs: 40.70 | 15: iteration 67410/ 125429 | consumed samples: 17256960 | consumed tokens: 35342254080 | elapsed time per iteration (s): 1.03 | learning rate: 1.008E-04 | global batch size: 256 | lm loss: 1.977563E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.044 | TFLOPs: 41.16 | 15: iteration 67420/ 125429 | consumed samples: 17259520 | consumed tokens: 35347496960 | elapsed time per iteration (s): 1.04 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.999712E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.649 | TFLOPs: 40.60 | 15: iteration 67430/ 125429 | consumed samples: 17262080 | consumed tokens: 35352739840 | elapsed time per iteration (s): 1.04 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.935299E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.056 | TFLOPs: 40.66 | 15: iteration 67440/ 125429 | consumed samples: 17264640 | consumed tokens: 35357982720 | elapsed time per iteration (s): 1.03 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.979547E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.497 | TFLOPs: 41.07 | 15: iteration 67450/ 125429 | consumed samples: 17267200 | consumed tokens: 35363225600 | elapsed time per iteration (s): 1.05 | learning rate: 1.007E-04 | global batch size: 256 | lm loss: 1.975553E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.662 | TFLOPs: 40.27 | 15: iteration 67460/ 125429 | consumed samples: 17269760 | consumed tokens: 35368468480 | elapsed time per iteration (s): 1.05 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.951439E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.991 | TFLOPs: 40.16 | 15: iteration 67470/ 125429 | consumed samples: 17272320 | consumed tokens: 35373711360 | elapsed time per iteration (s): 1.05 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.977682E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.925 | TFLOPs: 40.15 | 15: iteration 67480/ 125429 | consumed samples: 17274880 | consumed tokens: 35378954240 | elapsed time per iteration (s): 1.08 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.961929E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.965 | TFLOPs: 39.16 | 15: iteration 67490/ 125429 | consumed samples: 17277440 | consumed tokens: 35384197120 | elapsed time per iteration (s): 2.71 | learning rate: 1.006E-04 | global batch size: 256 | lm loss: 1.996404E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 94.318 | TFLOPs: 15.59 | 15: iteration 67500/ 125429 | consumed samples: 17280000 | consumed tokens: 35389440000 | elapsed time per iteration (s): 1.04 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.987125E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.687 | TFLOPs: 40.60 | 15: iteration 67510/ 125429 | consumed samples: 17282560 | consumed tokens: 35394682880 | elapsed time per iteration (s): 1.02 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.983779E+00 | grad norm: 0.290 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.129 | TFLOPs: 41.34 | 15: iteration 67520/ 125429 | consumed samples: 17285120 | consumed tokens: 35399925760 | elapsed time per iteration (s): 1.03 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.968478E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.028 | TFLOPs: 40.99 | 15: iteration 67530/ 125429 | consumed samples: 17287680 | consumed tokens: 35405168640 | elapsed time per iteration (s): 1.04 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.950369E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.795 | TFLOPs: 40.62 | 15: iteration 67540/ 125429 | consumed samples: 17290240 | consumed tokens: 35410411520 | elapsed time per iteration (s): 1.05 | learning rate: 1.005E-04 | global batch size: 256 | lm loss: 1.979893E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.865 | TFLOPs: 40.47 | 15: iteration 67550/ 125429 | consumed samples: 17292800 | consumed tokens: 35415654400 | elapsed time per iteration (s): 1.04 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.957423E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.907 | TFLOPs: 40.80 | 15: iteration 67560/ 125429 | consumed samples: 17295360 | consumed tokens: 35420897280 | elapsed time per iteration (s): 1.03 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.988877E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.771 | TFLOPs: 40.95 | 15: iteration 67570/ 125429 | consumed samples: 17297920 | consumed tokens: 35426140160 | elapsed time per iteration (s): 1.02 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 1.949092E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.936 | TFLOPs: 41.30 | 15: iteration 67580/ 125429 | consumed samples: 17300480 | consumed tokens: 35431383040 | elapsed time per iteration (s): 1.03 | learning rate: 1.004E-04 | global batch size: 256 | lm loss: 2.003056E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.263 | TFLOPs: 41.19 | 15: iteration 67590/ 125429 | consumed samples: 17303040 | consumed tokens: 35436625920 | elapsed time per iteration (s): 1.07 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.978361E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.232 | TFLOPs: 39.70 | 15: iteration 67600/ 125429 | consumed samples: 17305600 | consumed tokens: 35441868800 | elapsed time per iteration (s): 1.04 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.956536E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.327 | TFLOPs: 40.54 | 15: iteration 67610/ 125429 | consumed samples: 17308160 | consumed tokens: 35447111680 | elapsed time per iteration (s): 1.03 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.959186E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.918 | TFLOPs: 41.14 | 15: iteration 67620/ 125429 | consumed samples: 17310720 | consumed tokens: 35452354560 | elapsed time per iteration (s): 1.06 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.989585E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.473 | TFLOPs: 40.07 | 15: iteration 67630/ 125429 | consumed samples: 17313280 | consumed tokens: 35457597440 | elapsed time per iteration (s): 1.07 | learning rate: 1.003E-04 | global batch size: 256 | lm loss: 1.964252E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.782 | TFLOPs: 39.63 | 15: iteration 67640/ 125429 | consumed samples: 17315840 | consumed tokens: 35462840320 | elapsed time per iteration (s): 1.03 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.947159E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.136 | TFLOPs: 41.01 | 15: iteration 67650/ 125429 | consumed samples: 17318400 | consumed tokens: 35468083200 | elapsed time per iteration (s): 2.03 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.989350E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 125.898 | TFLOPs: 20.81 | 15: iteration 67660/ 125429 | consumed samples: 17320960 | consumed tokens: 35473326080 | elapsed time per iteration (s): 1.10 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.988469E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.382 | TFLOPs: 38.40 | 15: iteration 67670/ 125429 | consumed samples: 17323520 | consumed tokens: 35478568960 | elapsed time per iteration (s): 1.06 | learning rate: 1.002E-04 | global batch size: 256 | lm loss: 1.969106E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.737 | TFLOPs: 39.95 | 15: iteration 67680/ 125429 | consumed samples: 17326080 | consumed tokens: 35483811840 | elapsed time per iteration (s): 1.05 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.960113E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.249 | TFLOPs: 40.36 | 15: iteration 67690/ 125429 | consumed samples: 17328640 | consumed tokens: 35489054720 | elapsed time per iteration (s): 1.05 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.997038E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.046 | TFLOPs: 40.17 | 15: iteration 67700/ 125429 | consumed samples: 17331200 | consumed tokens: 35494297600 | elapsed time per iteration (s): 1.05 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.959120E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.742 | TFLOPs: 40.11 | 15: iteration 67710/ 125429 | consumed samples: 17333760 | consumed tokens: 35499540480 | elapsed time per iteration (s): 1.04 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.969698E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.986 | TFLOPs: 40.82 | 15: iteration 67720/ 125429 | consumed samples: 17336320 | consumed tokens: 35504783360 | elapsed time per iteration (s): 1.06 | learning rate: 1.001E-04 | global batch size: 256 | lm loss: 1.991043E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.466 | TFLOPs: 40.07 | 15: iteration 67730/ 125429 | consumed samples: 17338880 | consumed tokens: 35510026240 | elapsed time per iteration (s): 1.08 | learning rate: 1.000E-04 | global batch size: 256 | lm loss: 1.967138E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.078 | TFLOPs: 39.01 | 15: iteration 67740/ 125429 | consumed samples: 17341440 | consumed tokens: 35515269120 | elapsed time per iteration (s): 1.06 | learning rate: 1.000E-04 | global batch size: 256 | lm loss: 2.002830E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.912 | TFLOPs: 39.81 | 15: iteration 67750/ 125429 | consumed samples: 17344000 | consumed tokens: 35520512000 | elapsed time per iteration (s): 1.07 | learning rate: 9.998E-05 | global batch size: 256 | lm loss: 1.993584E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.629 | TFLOPs: 39.60 | 15: iteration 67760/ 125429 | consumed samples: 17346560 | consumed tokens: 35525754880 | elapsed time per iteration (s): 1.04 | learning rate: 9.996E-05 | global batch size: 256 | lm loss: 1.969197E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.122 | TFLOPs: 40.84 | 15: iteration 67770/ 125429 | consumed samples: 17349120 | consumed tokens: 35530997760 | elapsed time per iteration (s): 1.05 | learning rate: 9.994E-05 | global batch size: 256 | lm loss: 1.973076E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.641 | TFLOPs: 40.26 | 15: iteration 67780/ 125429 | consumed samples: 17351680 | consumed tokens: 35536240640 | elapsed time per iteration (s): 1.04 | learning rate: 9.992E-05 | global batch size: 256 | lm loss: 1.953512E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.734 | TFLOPs: 40.61 | 15: iteration 67790/ 125429 | consumed samples: 17354240 | consumed tokens: 35541483520 | elapsed time per iteration (s): 1.02 | learning rate: 9.989E-05 | global batch size: 256 | lm loss: 1.973267E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.879 | TFLOPs: 41.46 | 15: iteration 67800/ 125429 | consumed samples: 17356800 | consumed tokens: 35546726400 | elapsed time per iteration (s): 1.06 | learning rate: 9.987E-05 | global batch size: 256 | lm loss: 1.958661E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.419 | TFLOPs: 39.90 | 15: iteration 67810/ 125429 | consumed samples: 17359360 | consumed tokens: 35551969280 | elapsed time per iteration (s): 1.17 | learning rate: 9.985E-05 | global batch size: 256 | lm loss: 1.995322E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.606 | TFLOPs: 36.29 | 15: iteration 67820/ 125429 | consumed samples: 17361920 | consumed tokens: 35557212160 | elapsed time per iteration (s): 1.05 | learning rate: 9.982E-05 | global batch size: 256 | lm loss: 1.969742E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.831 | TFLOPs: 40.46 | 15: iteration 67830/ 125429 | consumed samples: 17364480 | consumed tokens: 35562455040 | elapsed time per iteration (s): 1.02 | learning rate: 9.980E-05 | global batch size: 256 | lm loss: 1.959967E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.684 | TFLOPs: 41.43 | 15: iteration 67840/ 125429 | consumed samples: 17367040 | consumed tokens: 35567697920 | elapsed time per iteration (s): 1.04 | learning rate: 9.978E-05 | global batch size: 256 | lm loss: 2.010472E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.007 | TFLOPs: 40.65 | 15: iteration 67850/ 125429 | consumed samples: 17369600 | consumed tokens: 35572940800 | elapsed time per iteration (s): 1.03 | learning rate: 9.976E-05 | global batch size: 256 | lm loss: 1.964854E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.469 | TFLOPs: 40.90 | 15: iteration 67860/ 125429 | consumed samples: 17372160 | consumed tokens: 35578183680 | elapsed time per iteration (s): 1.02 | learning rate: 9.973E-05 | global batch size: 256 | lm loss: 1.966020E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.563 | TFLOPs: 41.57 | 15: iteration 67870/ 125429 | consumed samples: 17374720 | consumed tokens: 35583426560 | elapsed time per iteration (s): 1.05 | learning rate: 9.971E-05 | global batch size: 256 | lm loss: 1.980957E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.717 | TFLOPs: 40.28 | 15: iteration 67880/ 125429 | consumed samples: 17377280 | consumed tokens: 35588669440 | elapsed time per iteration (s): 1.05 | learning rate: 9.969E-05 | global batch size: 256 | lm loss: 1.991034E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.413 | TFLOPs: 40.39 | 15: iteration 67890/ 125429 | consumed samples: 17379840 | consumed tokens: 35593912320 | elapsed time per iteration (s): 1.06 | learning rate: 9.967E-05 | global batch size: 256 | lm loss: 1.986456E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.463 | TFLOPs: 40.07 | 15: iteration 67900/ 125429 | consumed samples: 17382400 | consumed tokens: 35599155200 | elapsed time per iteration (s): 1.03 | learning rate: 9.964E-05 | global batch size: 256 | lm loss: 1.968031E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.474 | TFLOPs: 41.23 | 15: iteration 67910/ 125429 | consumed samples: 17384960 | consumed tokens: 35604398080 | elapsed time per iteration (s): 1.02 | learning rate: 9.962E-05 | global batch size: 256 | lm loss: 1.956531E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.440 | TFLOPs: 41.39 | 15: iteration 67920/ 125429 | consumed samples: 17387520 | consumed tokens: 35609640960 | elapsed time per iteration (s): 1.05 | learning rate: 9.960E-05 | global batch size: 256 | lm loss: 1.969536E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.726 | TFLOPs: 40.28 | 15: iteration 67930/ 125429 | consumed samples: 17390080 | consumed tokens: 35614883840 | elapsed time per iteration (s): 1.03 | learning rate: 9.958E-05 | global batch size: 256 | lm loss: 1.983526E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.592 | TFLOPs: 40.92 | 15: iteration 67940/ 125429 | consumed samples: 17392640 | consumed tokens: 35620126720 | elapsed time per iteration (s): 1.05 | learning rate: 9.955E-05 | global batch size: 256 | lm loss: 1.967813E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.912 | TFLOPs: 40.31 | 15: iteration 67950/ 125429 | consumed samples: 17395200 | consumed tokens: 35625369600 | elapsed time per iteration (s): 1.02 | learning rate: 9.953E-05 | global batch size: 256 | lm loss: 1.963722E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.814 | TFLOPs: 41.45 | 15: iteration 67960/ 125429 | consumed samples: 17397760 | consumed tokens: 35630612480 | elapsed time per iteration (s): 1.02 | learning rate: 9.951E-05 | global batch size: 256 | lm loss: 1.943440E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.037 | TFLOPs: 41.49 | 15: iteration 67970/ 125429 | consumed samples: 17400320 | consumed tokens: 35635855360 | elapsed time per iteration (s): 1.04 | learning rate: 9.949E-05 | global batch size: 256 | lm loss: 1.944985E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.999 | TFLOPs: 40.82 | 15: iteration 67980/ 125429 | consumed samples: 17402880 | consumed tokens: 35641098240 | elapsed time per iteration (s): 1.02 | learning rate: 9.946E-05 | global batch size: 256 | lm loss: 1.985618E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.526 | TFLOPs: 41.40 | 15: iteration 67990/ 125429 | consumed samples: 17405440 | consumed tokens: 35646341120 | elapsed time per iteration (s): 1.03 | learning rate: 9.944E-05 | global batch size: 256 | lm loss: 1.962758E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.354 | TFLOPs: 41.21 | 0: [2022-11-26 16:13:42,704] [INFO] [logging.py:68:log_dist] [Rank 0] step=68000, skipped=0, lr=[9.941762668909177e-05, 9.941762668909177e-05, 9.941762668909177e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 68000/ 125429 | consumed samples: 17408000 | consumed tokens: 35651584000 | elapsed time per iteration (s): 1.03 | learning rate: 9.942E-05 | global batch size: 256 | lm loss: 1.979575E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.073 | TFLOPs: 41.00 | 0: steps: 68000 loss: 1.9891 iter time (s): 1.080 samples/sec: 237.143 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 68000 | lm loss value: 1.828014E+00 | lm loss PPL: 6.221517E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 68000 to checkpoints_1b5 0: [2022-11-26 16:13:43,070] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step68000 is begin to save! 0: [2022-11-26 16:13:43,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:13:43,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:13:43,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:13:43,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:13:43,456] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:13:43,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:13:43,564] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:13:43,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:13:43,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:13:43,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:13:43,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:13:43,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:13:43,889] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:13:43,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:13:43,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:13:44,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:13:44,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:13:44,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:13:44,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:13:44,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:13:44,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:13:44,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:13:44,419] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:13:44,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:13:44,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:13:44,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:13:44,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:13:44,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:13:44,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:13:44,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:13:44,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:13:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:13:44,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:13:45,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:13:45,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:13:45,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:13:45,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:13:45,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:13:45,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:13:45,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:13:45,353] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:13:45,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:13:45,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:13:45,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:13:45,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:13:45,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:13:45,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:13:45,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:13:45,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:13:45,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:13:45,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:13:45,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:13:45,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:13:46,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:13:46,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_29-model_00-model_states.pt... 0: [2022-11-26 16:13:46,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_29-model_00-model_states.pt. 0: [2022-11-26 16:13:46,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:13:46,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:13:46,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/layer_32-model_00-model_states.pt... 0: [2022-11-26 16:13:46,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/layer_32-model_00-model_states.pt. 0: [2022-11-26 16:13:46,293] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step68000/mp_rank_00_model_states.pt 0: [2022-11-26 16:13:46,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:13:46,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:13:46,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step68000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:13:46,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 16:13:46,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 16:13:46,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 16:13:46,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:13:46,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:13:46,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 16:13:46,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:13:46,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 16:13:46,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:13:46,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:13:46,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:13:46,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:13:46,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 16:13:46,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 16:13:46,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:13:46,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 16:13:46,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 16:13:46,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 16:13:46,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 16:13:46,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:13:46,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:13:46,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 2: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:13:46,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:13:46,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 16:13:46,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:13:46,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:13:46,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:13:46,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 16:13:46,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 16:13:46,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 7: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:13:46,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:13:46,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 16:13:46,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 9: [2022-11-26 16:13:46,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:13:46,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 16:13:46,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 12: [2022-11-26 16:13:46,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:13:46,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 16:13:46,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 16:13:46,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 5: [2022-11-26 16:13:46,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:13:46,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 7: [2022-11-26 16:13:46,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 5: [2022-11-26 16:13:46,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:13:46,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:13:46,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:13:46,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 10: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:13:46,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:13:46,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:13:46,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:13:46,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:13:46,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:13:46,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 15: [2022-11-26 16:13:46,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 16:13:46,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,544] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,544] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 3: [2022-11-26 16:13:46,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:13:46,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 16:13:46,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 14: [2022-11-26 16:13:46,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:13:46,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 16:13:46,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:13:46,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 6: [2022-11-26 16:13:46,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:13:46,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:13:46,583] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 16:13:46,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:13:46,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 11: [2022-11-26 16:13:46,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 8: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 11: [2022-11-26 16:13:46,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 13: [2022-11-26 16:13:46,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:13:46,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:13:46,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:13:46,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:13:46,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:13:46,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 1: [2022-11-26 16:13:46,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:13:46,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:13:46,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: [2022-11-26 16:13:46,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:13:46,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 4: [2022-11-26 16:13:46,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:13:46,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step68000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 16:13:46,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step68000 is ready now! 0: successfully saved checkpoint at iteration 68000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3667.33 15: iteration 68010/ 125429 | consumed samples: 17410560 | consumed tokens: 35656826880 | elapsed time per iteration (s): 1.42 | learning rate: 9.940E-05 | global batch size: 256 | lm loss: 1.981812E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.898 | TFLOPs: 29.73 | 15: iteration 68020/ 125429 | consumed samples: 17413120 | consumed tokens: 35662069760 | elapsed time per iteration (s): 1.02 | learning rate: 9.937E-05 | global batch size: 256 | lm loss: 1.959875E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.813 | TFLOPs: 41.45 | 15: iteration 68030/ 125429 | consumed samples: 17415680 | consumed tokens: 35667312640 | elapsed time per iteration (s): 1.02 | learning rate: 9.935E-05 | global batch size: 256 | lm loss: 1.961786E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.857 | TFLOPs: 41.29 | 15: iteration 68040/ 125429 | consumed samples: 17418240 | consumed tokens: 35672555520 | elapsed time per iteration (s): 1.04 | learning rate: 9.933E-05 | global batch size: 256 | lm loss: 1.972096E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.027 | TFLOPs: 40.49 | 15: iteration 68050/ 125429 | consumed samples: 17420800 | consumed tokens: 35677798400 | elapsed time per iteration (s): 1.03 | learning rate: 9.930E-05 | global batch size: 256 | lm loss: 1.964317E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.960 | TFLOPs: 40.98 | 15: iteration 68060/ 125429 | consumed samples: 17423360 | consumed tokens: 35683041280 | elapsed time per iteration (s): 1.02 | learning rate: 9.928E-05 | global batch size: 256 | lm loss: 1.984533E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.632 | TFLOPs: 41.42 | 15: iteration 68070/ 125429 | consumed samples: 17425920 | consumed tokens: 35688284160 | elapsed time per iteration (s): 1.09 | learning rate: 9.926E-05 | global batch size: 256 | lm loss: 1.961737E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.600 | TFLOPs: 38.93 | 15: iteration 68080/ 125429 | consumed samples: 17428480 | consumed tokens: 35693527040 | elapsed time per iteration (s): 1.05 | learning rate: 9.924E-05 | global batch size: 256 | lm loss: 1.970419E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.150 | TFLOPs: 40.35 | 15: iteration 68090/ 125429 | consumed samples: 17431040 | consumed tokens: 35698769920 | elapsed time per iteration (s): 1.02 | learning rate: 9.921E-05 | global batch size: 256 | lm loss: 1.996026E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.260 | TFLOPs: 41.52 | 15: iteration 68100/ 125429 | consumed samples: 17433600 | consumed tokens: 35704012800 | elapsed time per iteration (s): 1.03 | learning rate: 9.919E-05 | global batch size: 256 | lm loss: 1.999413E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.658 | TFLOPs: 41.09 | 15: iteration 68110/ 125429 | consumed samples: 17436160 | consumed tokens: 35709255680 | elapsed time per iteration (s): 1.05 | learning rate: 9.917E-05 | global batch size: 256 | lm loss: 1.985073E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.421 | TFLOPs: 40.39 | 15: iteration 68120/ 125429 | consumed samples: 17438720 | consumed tokens: 35714498560 | elapsed time per iteration (s): 1.05 | learning rate: 9.915E-05 | global batch size: 256 | lm loss: 1.954883E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.114 | TFLOPs: 40.34 | 15: iteration 68130/ 125429 | consumed samples: 17441280 | consumed tokens: 35719741440 | elapsed time per iteration (s): 1.02 | learning rate: 9.912E-05 | global batch size: 256 | lm loss: 1.986103E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.194 | TFLOPs: 41.35 | 15: iteration 68140/ 125429 | consumed samples: 17443840 | consumed tokens: 35724984320 | elapsed time per iteration (s): 1.04 | learning rate: 9.910E-05 | global batch size: 256 | lm loss: 1.973853E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.323 | TFLOPs: 40.87 | 15: iteration 68150/ 125429 | consumed samples: 17446400 | consumed tokens: 35730227200 | elapsed time per iteration (s): 1.04 | learning rate: 9.908E-05 | global batch size: 256 | lm loss: 1.973947E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.656 | TFLOPs: 40.76 | 15: iteration 68160/ 125429 | consumed samples: 17448960 | consumed tokens: 35735470080 | elapsed time per iteration (s): 1.07 | learning rate: 9.906E-05 | global batch size: 256 | lm loss: 1.979600E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.644 | TFLOPs: 39.60 | 15: iteration 68170/ 125429 | consumed samples: 17451520 | consumed tokens: 35740712960 | elapsed time per iteration (s): 1.02 | learning rate: 9.903E-05 | global batch size: 256 | lm loss: 1.994493E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.062 | TFLOPs: 41.49 | 15: iteration 68180/ 125429 | consumed samples: 17454080 | consumed tokens: 35745955840 | elapsed time per iteration (s): 1.02 | learning rate: 9.901E-05 | global batch size: 256 | lm loss: 1.978160E+00 | grad norm: 0.209 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.696 | TFLOPs: 41.43 | 15: iteration 68190/ 125429 | consumed samples: 17456640 | consumed tokens: 35751198720 | elapsed time per iteration (s): 1.03 | learning rate: 9.899E-05 | global batch size: 256 | lm loss: 1.992892E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.524 | TFLOPs: 41.07 | 15: iteration 68200/ 125429 | consumed samples: 17459200 | consumed tokens: 35756441600 | elapsed time per iteration (s): 1.03 | learning rate: 9.897E-05 | global batch size: 256 | lm loss: 1.989585E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.541 | TFLOPs: 41.24 | 15: iteration 68210/ 125429 | consumed samples: 17461760 | consumed tokens: 35761684480 | elapsed time per iteration (s): 1.02 | learning rate: 9.894E-05 | global batch size: 256 | lm loss: 1.960606E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.526 | TFLOPs: 41.40 | 15: iteration 68220/ 125429 | consumed samples: 17464320 | consumed tokens: 35766927360 | elapsed time per iteration (s): 1.07 | learning rate: 9.892E-05 | global batch size: 256 | lm loss: 1.975666E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.349 | TFLOPs: 39.55 | 15: iteration 68230/ 125429 | consumed samples: 17466880 | consumed tokens: 35772170240 | elapsed time per iteration (s): 1.04 | learning rate: 9.890E-05 | global batch size: 256 | lm loss: 2.003804E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.989 | TFLOPs: 40.65 | 15: iteration 68240/ 125429 | consumed samples: 17469440 | consumed tokens: 35777413120 | elapsed time per iteration (s): 1.05 | learning rate: 9.888E-05 | global batch size: 256 | lm loss: 1.991702E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.032 | TFLOPs: 40.33 | 15: iteration 68250/ 125429 | consumed samples: 17472000 | consumed tokens: 35782656000 | elapsed time per iteration (s): 1.03 | learning rate: 9.885E-05 | global batch size: 256 | lm loss: 1.985600E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.442 | TFLOPs: 41.22 | 15: iteration 68260/ 125429 | consumed samples: 17474560 | consumed tokens: 35787898880 | elapsed time per iteration (s): 1.05 | learning rate: 9.883E-05 | global batch size: 256 | lm loss: 1.972840E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.893 | TFLOPs: 40.47 | 15: iteration 68270/ 125429 | consumed samples: 17477120 | consumed tokens: 35793141760 | elapsed time per iteration (s): 1.02 | learning rate: 9.881E-05 | global batch size: 256 | lm loss: 1.964532E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.111 | TFLOPs: 41.33 | 15: iteration 68280/ 125429 | consumed samples: 17479680 | consumed tokens: 35798384640 | elapsed time per iteration (s): 1.04 | learning rate: 9.878E-05 | global batch size: 256 | lm loss: 1.961982E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.260 | TFLOPs: 40.53 | 15: iteration 68290/ 125429 | consumed samples: 17482240 | consumed tokens: 35803627520 | elapsed time per iteration (s): 1.03 | learning rate: 9.876E-05 | global batch size: 256 | lm loss: 1.992097E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.059 | TFLOPs: 41.16 | 15: iteration 68300/ 125429 | consumed samples: 17484800 | consumed tokens: 35808870400 | elapsed time per iteration (s): 1.04 | learning rate: 9.874E-05 | global batch size: 256 | lm loss: 1.964024E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.514 | TFLOPs: 40.57 | 15: iteration 68310/ 125429 | consumed samples: 17487360 | consumed tokens: 35814113280 | elapsed time per iteration (s): 1.03 | learning rate: 9.872E-05 | global batch size: 256 | lm loss: 1.958496E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.387 | TFLOPs: 41.05 | 15: iteration 68320/ 125429 | consumed samples: 17489920 | consumed tokens: 35819356160 | elapsed time per iteration (s): 1.04 | learning rate: 9.869E-05 | global batch size: 256 | lm loss: 1.996288E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.299 | TFLOPs: 40.87 | 15: iteration 68330/ 125429 | consumed samples: 17492480 | consumed tokens: 35824599040 | elapsed time per iteration (s): 1.15 | learning rate: 9.867E-05 | global batch size: 256 | lm loss: 1.963683E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.514 | TFLOPs: 36.94 | 15: iteration 68340/ 125429 | consumed samples: 17495040 | consumed tokens: 35829841920 | elapsed time per iteration (s): 1.03 | learning rate: 9.865E-05 | global batch size: 256 | lm loss: 1.966666E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.407 | TFLOPs: 41.22 | 15: iteration 68350/ 125429 | consumed samples: 17497600 | consumed tokens: 35835084800 | elapsed time per iteration (s): 1.05 | learning rate: 9.863E-05 | global batch size: 256 | lm loss: 1.980352E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.782 | TFLOPs: 40.45 | 15: iteration 68360/ 125429 | consumed samples: 17500160 | consumed tokens: 35840327680 | elapsed time per iteration (s): 1.03 | learning rate: 9.860E-05 | global batch size: 256 | lm loss: 1.957731E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.263 | TFLOPs: 41.03 | 15: iteration 68370/ 125429 | consumed samples: 17502720 | consumed tokens: 35845570560 | elapsed time per iteration (s): 1.03 | learning rate: 9.858E-05 | global batch size: 256 | lm loss: 1.955181E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.462 | TFLOPs: 40.90 | 15: iteration 68380/ 125429 | consumed samples: 17505280 | consumed tokens: 35850813440 | elapsed time per iteration (s): 1.03 | learning rate: 9.856E-05 | global batch size: 256 | lm loss: 2.002038E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.016 | TFLOPs: 40.99 | 15: iteration 68390/ 125429 | consumed samples: 17507840 | consumed tokens: 35856056320 | elapsed time per iteration (s): 1.05 | learning rate: 9.854E-05 | global batch size: 256 | lm loss: 1.965804E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.635 | TFLOPs: 40.43 | 15: iteration 68400/ 125429 | consumed samples: 17510400 | consumed tokens: 35861299200 | elapsed time per iteration (s): 1.04 | learning rate: 9.851E-05 | global batch size: 256 | lm loss: 2.038939E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.472 | TFLOPs: 40.73 | 15: iteration 68410/ 125429 | consumed samples: 17512960 | consumed tokens: 35866542080 | elapsed time per iteration (s): 1.03 | learning rate: 9.849E-05 | global batch size: 256 | lm loss: 1.970308E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.743 | TFLOPs: 41.11 | 15: iteration 68420/ 125429 | consumed samples: 17515520 | consumed tokens: 35871784960 | elapsed time per iteration (s): 1.02 | learning rate: 9.847E-05 | global batch size: 256 | lm loss: 1.959937E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.348 | TFLOPs: 41.37 | 15: iteration 68430/ 125429 | consumed samples: 17518080 | consumed tokens: 35877027840 | elapsed time per iteration (s): 1.04 | learning rate: 9.845E-05 | global batch size: 256 | lm loss: 2.006958E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.284 | TFLOPs: 40.87 | 15: iteration 68440/ 125429 | consumed samples: 17520640 | consumed tokens: 35882270720 | elapsed time per iteration (s): 1.05 | learning rate: 9.842E-05 | global batch size: 256 | lm loss: 1.981287E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.662 | TFLOPs: 40.27 | 15: iteration 68450/ 125429 | consumed samples: 17523200 | consumed tokens: 35887513600 | elapsed time per iteration (s): 1.03 | learning rate: 9.840E-05 | global batch size: 256 | lm loss: 1.958716E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.661 | TFLOPs: 41.09 | 15: iteration 68460/ 125429 | consumed samples: 17525760 | consumed tokens: 35892756480 | elapsed time per iteration (s): 1.03 | learning rate: 9.838E-05 | global batch size: 256 | lm loss: 1.923654E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.101 | TFLOPs: 41.00 | 15: iteration 68470/ 125429 | consumed samples: 17528320 | consumed tokens: 35897999360 | elapsed time per iteration (s): 1.03 | learning rate: 9.836E-05 | global batch size: 256 | lm loss: 1.965078E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.760 | TFLOPs: 40.94 | 15: iteration 68480/ 125429 | consumed samples: 17530880 | consumed tokens: 35903242240 | elapsed time per iteration (s): 1.04 | learning rate: 9.833E-05 | global batch size: 256 | lm loss: 1.985159E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.220 | TFLOPs: 40.85 | 15: iteration 68490/ 125429 | consumed samples: 17533440 | consumed tokens: 35908485120 | elapsed time per iteration (s): 1.04 | learning rate: 9.831E-05 | global batch size: 256 | lm loss: 1.982905E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.738 | TFLOPs: 40.78 | 15: iteration 68500/ 125429 | consumed samples: 17536000 | consumed tokens: 35913728000 | elapsed time per iteration (s): 1.04 | learning rate: 9.829E-05 | global batch size: 256 | lm loss: 1.974638E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.358 | TFLOPs: 40.71 | 15: iteration 68510/ 125429 | consumed samples: 17538560 | consumed tokens: 35918970880 | elapsed time per iteration (s): 1.02 | learning rate: 9.827E-05 | global batch size: 256 | lm loss: 1.961065E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.962 | TFLOPs: 41.47 | 15: iteration 68520/ 125429 | consumed samples: 17541120 | consumed tokens: 35924213760 | elapsed time per iteration (s): 1.06 | learning rate: 9.824E-05 | global batch size: 256 | lm loss: 1.985465E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.944 | TFLOPs: 39.98 | 15: iteration 68530/ 125429 | consumed samples: 17543680 | consumed tokens: 35929456640 | elapsed time per iteration (s): 1.02 | learning rate: 9.822E-05 | global batch size: 256 | lm loss: 1.992385E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.356 | TFLOPs: 41.37 | 15: iteration 68540/ 125429 | consumed samples: 17546240 | consumed tokens: 35934699520 | elapsed time per iteration (s): 1.03 | learning rate: 9.820E-05 | global batch size: 256 | lm loss: 1.959346E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.814 | TFLOPs: 40.95 | 15: iteration 68550/ 125429 | consumed samples: 17548800 | consumed tokens: 35939942400 | elapsed time per iteration (s): 1.03 | learning rate: 9.818E-05 | global batch size: 256 | lm loss: 1.986925E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.536 | TFLOPs: 40.91 | 15: iteration 68560/ 125429 | consumed samples: 17551360 | consumed tokens: 35945185280 | elapsed time per iteration (s): 1.04 | learning rate: 9.815E-05 | global batch size: 256 | lm loss: 1.961880E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.138 | TFLOPs: 40.51 | 15: iteration 68570/ 125429 | consumed samples: 17553920 | consumed tokens: 35950428160 | elapsed time per iteration (s): 1.03 | learning rate: 9.813E-05 | global batch size: 256 | lm loss: 1.973110E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.892 | TFLOPs: 40.97 | 15: iteration 68580/ 125429 | consumed samples: 17556480 | consumed tokens: 35955671040 | elapsed time per iteration (s): 1.02 | learning rate: 9.811E-05 | global batch size: 256 | lm loss: 1.994963E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.146 | TFLOPs: 41.34 | 15: iteration 68590/ 125429 | consumed samples: 17559040 | consumed tokens: 35960913920 | elapsed time per iteration (s): 1.03 | learning rate: 9.808E-05 | global batch size: 256 | lm loss: 1.990977E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.608 | TFLOPs: 40.92 | 15: iteration 68600/ 125429 | consumed samples: 17561600 | consumed tokens: 35966156800 | elapsed time per iteration (s): 1.03 | learning rate: 9.806E-05 | global batch size: 256 | lm loss: 1.963848E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.875 | TFLOPs: 41.13 | 15: iteration 68610/ 125429 | consumed samples: 17564160 | consumed tokens: 35971399680 | elapsed time per iteration (s): 1.03 | learning rate: 9.804E-05 | global batch size: 256 | lm loss: 2.014069E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.590 | TFLOPs: 40.92 | 15: iteration 68620/ 125429 | consumed samples: 17566720 | consumed tokens: 35976642560 | elapsed time per iteration (s): 1.03 | learning rate: 9.802E-05 | global batch size: 256 | lm loss: 1.981520E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.960 | TFLOPs: 41.14 | 15: iteration 68630/ 125429 | consumed samples: 17569280 | consumed tokens: 35981885440 | elapsed time per iteration (s): 1.05 | learning rate: 9.799E-05 | global batch size: 256 | lm loss: 1.989512E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.287 | TFLOPs: 40.37 | 15: iteration 68640/ 125429 | consumed samples: 17571840 | consumed tokens: 35987128320 | elapsed time per iteration (s): 1.03 | learning rate: 9.797E-05 | global batch size: 256 | lm loss: 2.006573E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.730 | TFLOPs: 41.27 | 15: iteration 68650/ 125429 | consumed samples: 17574400 | consumed tokens: 35992371200 | elapsed time per iteration (s): 1.02 | learning rate: 9.795E-05 | global batch size: 256 | lm loss: 1.960676E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.815 | TFLOPs: 41.45 | 15: iteration 68660/ 125429 | consumed samples: 17576960 | consumed tokens: 35997614080 | elapsed time per iteration (s): 1.06 | learning rate: 9.793E-05 | global batch size: 256 | lm loss: 1.976450E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.422 | TFLOPs: 40.06 | 15: iteration 68670/ 125429 | consumed samples: 17579520 | consumed tokens: 36002856960 | elapsed time per iteration (s): 1.02 | learning rate: 9.790E-05 | global batch size: 256 | lm loss: 1.994946E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.970 | TFLOPs: 41.31 | 15: iteration 68680/ 125429 | consumed samples: 17582080 | consumed tokens: 36008099840 | elapsed time per iteration (s): 1.06 | learning rate: 9.788E-05 | global batch size: 256 | lm loss: 1.975668E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.749 | TFLOPs: 39.79 | 15: iteration 68690/ 125429 | consumed samples: 17584640 | consumed tokens: 36013342720 | elapsed time per iteration (s): 1.04 | learning rate: 9.786E-05 | global batch size: 256 | lm loss: 1.982216E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.321 | TFLOPs: 40.54 | 15: iteration 68700/ 125429 | consumed samples: 17587200 | consumed tokens: 36018585600 | elapsed time per iteration (s): 1.03 | learning rate: 9.784E-05 | global batch size: 256 | lm loss: 1.957680E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.540 | TFLOPs: 40.91 | 15: iteration 68710/ 125429 | consumed samples: 17589760 | consumed tokens: 36023828480 | elapsed time per iteration (s): 1.03 | learning rate: 9.781E-05 | global batch size: 256 | lm loss: 1.977959E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.192 | TFLOPs: 41.02 | 15: iteration 68720/ 125429 | consumed samples: 17592320 | consumed tokens: 36029071360 | elapsed time per iteration (s): 1.02 | learning rate: 9.779E-05 | global batch size: 256 | lm loss: 1.959140E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.915 | TFLOPs: 41.30 | 15: iteration 68730/ 125429 | consumed samples: 17594880 | consumed tokens: 36034314240 | elapsed time per iteration (s): 1.09 | learning rate: 9.777E-05 | global batch size: 256 | lm loss: 1.958563E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.397 | TFLOPs: 38.74 | 15: iteration 68740/ 125429 | consumed samples: 17597440 | consumed tokens: 36039557120 | elapsed time per iteration (s): 1.05 | learning rate: 9.775E-05 | global batch size: 256 | lm loss: 1.994643E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.401 | TFLOPs: 40.39 | 15: iteration 68750/ 125429 | consumed samples: 17600000 | consumed tokens: 36044800000 | elapsed time per iteration (s): 1.02 | learning rate: 9.772E-05 | global batch size: 256 | lm loss: 1.993013E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.054 | TFLOPs: 41.49 | 15: iteration 68760/ 125429 | consumed samples: 17602560 | consumed tokens: 36050042880 | elapsed time per iteration (s): 1.03 | learning rate: 9.770E-05 | global batch size: 256 | lm loss: 1.955363E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.996 | TFLOPs: 41.15 | 15: iteration 68770/ 125429 | consumed samples: 17605120 | consumed tokens: 36055285760 | elapsed time per iteration (s): 1.03 | learning rate: 9.768E-05 | global batch size: 256 | lm loss: 1.935401E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.593 | TFLOPs: 41.25 | 15: iteration 68780/ 125429 | consumed samples: 17607680 | consumed tokens: 36060528640 | elapsed time per iteration (s): 1.03 | learning rate: 9.766E-05 | global batch size: 256 | lm loss: 1.986387E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.358 | TFLOPs: 41.04 | 15: iteration 68790/ 125429 | consumed samples: 17610240 | consumed tokens: 36065771520 | elapsed time per iteration (s): 1.04 | learning rate: 9.763E-05 | global batch size: 256 | lm loss: 1.979768E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.620 | TFLOPs: 40.59 | 15: iteration 68800/ 125429 | consumed samples: 17612800 | consumed tokens: 36071014400 | elapsed time per iteration (s): 1.03 | learning rate: 9.761E-05 | global batch size: 256 | lm loss: 1.953205E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.719 | TFLOPs: 41.10 | 15: iteration 68810/ 125429 | consumed samples: 17615360 | consumed tokens: 36076257280 | elapsed time per iteration (s): 1.02 | learning rate: 9.759E-05 | global batch size: 256 | lm loss: 1.981008E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.831 | TFLOPs: 41.29 | 15: iteration 68820/ 125429 | consumed samples: 17617920 | consumed tokens: 36081500160 | elapsed time per iteration (s): 1.02 | learning rate: 9.757E-05 | global batch size: 256 | lm loss: 1.957228E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.498 | TFLOPs: 41.56 | 15: iteration 68830/ 125429 | consumed samples: 17620480 | consumed tokens: 36086743040 | elapsed time per iteration (s): 1.03 | learning rate: 9.754E-05 | global batch size: 256 | lm loss: 1.982882E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.300 | TFLOPs: 41.03 | 15: iteration 68840/ 125429 | consumed samples: 17623040 | consumed tokens: 36091985920 | elapsed time per iteration (s): 1.03 | learning rate: 9.752E-05 | global batch size: 256 | lm loss: 1.944932E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.965 | TFLOPs: 41.14 | 15: iteration 68850/ 125429 | consumed samples: 17625600 | consumed tokens: 36097228800 | elapsed time per iteration (s): 1.03 | learning rate: 9.750E-05 | global batch size: 256 | lm loss: 1.991634E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.812 | TFLOPs: 41.12 | 15: iteration 68860/ 125429 | consumed samples: 17628160 | consumed tokens: 36102471680 | elapsed time per iteration (s): 1.05 | learning rate: 9.748E-05 | global batch size: 256 | lm loss: 1.980338E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.894 | TFLOPs: 40.31 | 15: iteration 68870/ 125429 | consumed samples: 17630720 | consumed tokens: 36107714560 | elapsed time per iteration (s): 1.05 | learning rate: 9.745E-05 | global batch size: 256 | lm loss: 1.960788E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.550 | TFLOPs: 40.41 | 15: iteration 68880/ 125429 | consumed samples: 17633280 | consumed tokens: 36112957440 | elapsed time per iteration (s): 1.04 | learning rate: 9.743E-05 | global batch size: 256 | lm loss: 1.997139E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.500 | TFLOPs: 40.74 | 15: iteration 68890/ 125429 | consumed samples: 17635840 | consumed tokens: 36118200320 | elapsed time per iteration (s): 1.02 | learning rate: 9.741E-05 | global batch size: 256 | lm loss: 1.957313E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.849 | TFLOPs: 41.29 | 15: iteration 68900/ 125429 | consumed samples: 17638400 | consumed tokens: 36123443200 | elapsed time per iteration (s): 1.02 | learning rate: 9.739E-05 | global batch size: 256 | lm loss: 1.968731E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.863 | TFLOPs: 41.29 | 15: iteration 68910/ 125429 | consumed samples: 17640960 | consumed tokens: 36128686080 | elapsed time per iteration (s): 1.02 | learning rate: 9.736E-05 | global batch size: 256 | lm loss: 1.966617E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.040 | TFLOPs: 41.65 | 15: iteration 68920/ 125429 | consumed samples: 17643520 | consumed tokens: 36133928960 | elapsed time per iteration (s): 1.02 | learning rate: 9.734E-05 | global batch size: 256 | lm loss: 1.961521E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.265 | TFLOPs: 41.52 | 15: iteration 68930/ 125429 | consumed samples: 17646080 | consumed tokens: 36139171840 | elapsed time per iteration (s): 1.02 | learning rate: 9.732E-05 | global batch size: 256 | lm loss: 1.994655E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.852 | TFLOPs: 41.46 | 15: iteration 68940/ 125429 | consumed samples: 17648640 | consumed tokens: 36144414720 | elapsed time per iteration (s): 1.02 | learning rate: 9.730E-05 | global batch size: 256 | lm loss: 1.958802E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.888 | TFLOPs: 41.46 | 15: iteration 68950/ 125429 | consumed samples: 17651200 | consumed tokens: 36149657600 | elapsed time per iteration (s): 1.03 | learning rate: 9.727E-05 | global batch size: 256 | lm loss: 1.970321E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.961 | TFLOPs: 40.98 | 15: iteration 68960/ 125429 | consumed samples: 17653760 | consumed tokens: 36154900480 | elapsed time per iteration (s): 1.04 | learning rate: 9.725E-05 | global batch size: 256 | lm loss: 1.982445E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.071 | TFLOPs: 40.67 | 15: iteration 68970/ 125429 | consumed samples: 17656320 | consumed tokens: 36160143360 | elapsed time per iteration (s): 1.03 | learning rate: 9.723E-05 | global batch size: 256 | lm loss: 1.961106E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.285 | TFLOPs: 41.20 | 15: iteration 68980/ 125429 | consumed samples: 17658880 | consumed tokens: 36165386240 | elapsed time per iteration (s): 1.03 | learning rate: 9.721E-05 | global batch size: 256 | lm loss: 1.973409E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.884 | TFLOPs: 40.96 | 15: iteration 68990/ 125429 | consumed samples: 17661440 | consumed tokens: 36170629120 | elapsed time per iteration (s): 1.03 | learning rate: 9.718E-05 | global batch size: 256 | lm loss: 1.964050E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.106 | TFLOPs: 41.00 | 15: iteration 69000/ 125429 | consumed samples: 17664000 | consumed tokens: 36175872000 | elapsed time per iteration (s): 1.03 | learning rate: 9.716E-05 | global batch size: 256 | lm loss: 1.943994E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.755 | TFLOPs: 41.27 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 69000 | lm loss value: 1.912913E+00 | lm loss PPL: 6.772791E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 69000 to checkpoints_1b5 0: [2022-11-26 16:31:02,742] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step69000 is begin to save! 0: [2022-11-26 16:31:02,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:31:02,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:31:02,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:31:03,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:31:03,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:31:03,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:31:03,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:31:03,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:31:03,289] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:31:03,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:31:03,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:31:03,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:31:03,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:31:03,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:31:03,592] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:31:03,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:31:03,692] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:31:03,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:31:03,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:31:03,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:31:03,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:31:03,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:31:03,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:31:04,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:31:04,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:31:04,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:31:04,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:31:04,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:31:04,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:31:04,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:31:04,395] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:31:04,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:31:04,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:31:04,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:31:04,596] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:31:04,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:31:04,694] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:31:04,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:31:04,796] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:31:04,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:31:04,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:31:04,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:31:05,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:31:05,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:31:05,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:31:05,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:31:05,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:31:05,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:31:05,311] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:31:05,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:31:05,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:31:05,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:31:05,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:31:05,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:31:05,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_29-model_00-model_states.pt... 0: [2022-11-26 16:31:05,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_29-model_00-model_states.pt. 0: [2022-11-26 16:31:05,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:31:05,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:31:05,807] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/layer_32-model_00-model_states.pt... 0: [2022-11-26 16:31:05,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/layer_32-model_00-model_states.pt. 0: [2022-11-26 16:31:05,812] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step69000/mp_rank_00_model_states.pt 0: [2022-11-26 16:31:05,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:31:05,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:31:05,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step69000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:31:06,010] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 16:31:06,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:31:06,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 16:31:06,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 16:31:06,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 16:31:06,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 16:31:06,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:31:06,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:31:06,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:31:06,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:31:06,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:31:06,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 16:31:06,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:31:06,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 0: [2022-11-26 16:31:06,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 16:31:06,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 16:31:06,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 16:31:06,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 16:31:06,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 16:31:06,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:31:06,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 16:31:06,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 16:31:06,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 0: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:31:06,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 16:31:06,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 16:31:06,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:31:06,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:31:06,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 16:31:06,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 16:31:06,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 8: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:31:06,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: [2022-11-26 16:31:06,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:31:06,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 2: [2022-11-26 16:31:06,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:31:06,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 16:31:06,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 11: [2022-11-26 16:31:06,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:31:06,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 16:31:06,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,023] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:31:06,023] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,023] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:31:06,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:31:06,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:31:06,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 16:31:06,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 4: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:31:06,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:31:06,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:31:06,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:31:06,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 14: [2022-11-26 16:31:06,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:31:06,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 16:31:06,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: [2022-11-26 16:31:06,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:31:06,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 16:31:06,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 10: [2022-11-26 16:31:06,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:31:06,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 16:31:06,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 16:31:06,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 16:31:06,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 16:31:06,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 16:31:06,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:31:06,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:31:06,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 5: [2022-11-26 16:31:06,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 16:31:06,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 6: [2022-11-26 16:31:06,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:31:06,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 16:31:06,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 12: [2022-11-26 16:31:06,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:31:06,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 16:31:06,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 1: [2022-11-26 16:31:06,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:31:06,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 16:31:06,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:31:06,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 7: [2022-11-26 16:31:06,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:31:06,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:31:06,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 9: [2022-11-26 16:31:06,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:31:06,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 16:31:06,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 3: [2022-11-26 16:31:06,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:31:06,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 16:31:06,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 0: successfully saved checkpoint at iteration 69000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3401.95 13: [2022-11-26 16:31:06,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:31:06,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:31:06,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step69000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:31:06,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 13: [2022-11-26 16:31:06,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step69000 is ready now! 15: iteration 69010/ 125429 | consumed samples: 17666560 | consumed tokens: 36181114880 | elapsed time per iteration (s): 1.39 | learning rate: 9.714E-05 | global batch size: 256 | lm loss: 1.984290E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 183.661 | TFLOPs: 30.35 | 15: iteration 69020/ 125429 | consumed samples: 17669120 | consumed tokens: 36186357760 | elapsed time per iteration (s): 1.02 | learning rate: 9.711E-05 | global batch size: 256 | lm loss: 1.973149E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.742 | TFLOPs: 41.44 | 15: iteration 69030/ 125429 | consumed samples: 17671680 | consumed tokens: 36191600640 | elapsed time per iteration (s): 1.05 | learning rate: 9.709E-05 | global batch size: 256 | lm loss: 1.973461E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.911 | TFLOPs: 40.47 | 15: iteration 69040/ 125429 | consumed samples: 17674240 | consumed tokens: 36196843520 | elapsed time per iteration (s): 1.03 | learning rate: 9.707E-05 | global batch size: 256 | lm loss: 1.980279E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.577 | TFLOPs: 41.24 | 15: iteration 69050/ 125429 | consumed samples: 17676800 | consumed tokens: 36202086400 | elapsed time per iteration (s): 1.02 | learning rate: 9.705E-05 | global batch size: 256 | lm loss: 1.948164E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.289 | TFLOPs: 41.36 | 15: iteration 69060/ 125429 | consumed samples: 17679360 | consumed tokens: 36207329280 | elapsed time per iteration (s): 1.05 | learning rate: 9.702E-05 | global batch size: 256 | lm loss: 1.962415E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.425 | TFLOPs: 40.23 | 15: iteration 69070/ 125429 | consumed samples: 17681920 | consumed tokens: 36212572160 | elapsed time per iteration (s): 1.02 | learning rate: 9.700E-05 | global batch size: 256 | lm loss: 1.989008E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.662 | TFLOPs: 41.42 | 15: iteration 69080/ 125429 | consumed samples: 17684480 | consumed tokens: 36217815040 | elapsed time per iteration (s): 1.04 | learning rate: 9.698E-05 | global batch size: 256 | lm loss: 1.947206E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.517 | TFLOPs: 40.74 | 15: iteration 69090/ 125429 | consumed samples: 17687040 | consumed tokens: 36223057920 | elapsed time per iteration (s): 1.05 | learning rate: 9.696E-05 | global batch size: 256 | lm loss: 1.960248E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.570 | TFLOPs: 40.25 | 15: iteration 69100/ 125429 | consumed samples: 17689600 | consumed tokens: 36228300800 | elapsed time per iteration (s): 1.02 | learning rate: 9.693E-05 | global batch size: 256 | lm loss: 1.954804E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.015 | TFLOPs: 41.48 | 15: iteration 69110/ 125429 | consumed samples: 17692160 | consumed tokens: 36233543680 | elapsed time per iteration (s): 1.02 | learning rate: 9.691E-05 | global batch size: 256 | lm loss: 1.963180E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.383 | TFLOPs: 41.38 | 15: iteration 69120/ 125429 | consumed samples: 17694720 | consumed tokens: 36238786560 | elapsed time per iteration (s): 1.04 | learning rate: 9.689E-05 | global batch size: 256 | lm loss: 2.000733E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.764 | TFLOPs: 40.78 | 15: iteration 69130/ 125429 | consumed samples: 17697280 | consumed tokens: 36244029440 | elapsed time per iteration (s): 1.04 | learning rate: 9.687E-05 | global batch size: 256 | lm loss: 1.956840E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.245 | TFLOPs: 40.53 | 15: iteration 69140/ 125429 | consumed samples: 17699840 | consumed tokens: 36249272320 | elapsed time per iteration (s): 1.03 | learning rate: 9.684E-05 | global batch size: 256 | lm loss: 1.960554E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.703 | TFLOPs: 40.93 | 15: iteration 69150/ 125429 | consumed samples: 17702400 | consumed tokens: 36254515200 | elapsed time per iteration (s): 1.04 | learning rate: 9.682E-05 | global batch size: 256 | lm loss: 1.986300E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.016 | TFLOPs: 40.49 | 15: iteration 69160/ 125429 | consumed samples: 17704960 | consumed tokens: 36259758080 | elapsed time per iteration (s): 1.03 | learning rate: 9.680E-05 | global batch size: 256 | lm loss: 1.959057E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.782 | TFLOPs: 41.11 | 15: iteration 69170/ 125429 | consumed samples: 17707520 | consumed tokens: 36265000960 | elapsed time per iteration (s): 1.02 | learning rate: 9.678E-05 | global batch size: 256 | lm loss: 1.980818E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.458 | TFLOPs: 41.39 | 15: iteration 69180/ 125429 | consumed samples: 17710080 | consumed tokens: 36270243840 | elapsed time per iteration (s): 1.03 | learning rate: 9.675E-05 | global batch size: 256 | lm loss: 1.971773E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.898 | TFLOPs: 41.13 | 15: iteration 69190/ 125429 | consumed samples: 17712640 | consumed tokens: 36275486720 | elapsed time per iteration (s): 1.07 | learning rate: 9.673E-05 | global batch size: 256 | lm loss: 1.977806E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.448 | TFLOPs: 39.57 | 15: iteration 69200/ 125429 | consumed samples: 17715200 | consumed tokens: 36280729600 | elapsed time per iteration (s): 1.05 | learning rate: 9.671E-05 | global batch size: 256 | lm loss: 1.963198E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.423 | TFLOPs: 40.23 | 15: iteration 69210/ 125429 | consumed samples: 17717760 | consumed tokens: 36285972480 | elapsed time per iteration (s): 1.02 | learning rate: 9.669E-05 | global batch size: 256 | lm loss: 1.991009E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.270 | TFLOPs: 41.52 | 15: iteration 69220/ 125429 | consumed samples: 17720320 | consumed tokens: 36291215360 | elapsed time per iteration (s): 1.04 | learning rate: 9.666E-05 | global batch size: 256 | lm loss: 1.956544E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.899 | TFLOPs: 40.64 | 15: iteration 69230/ 125429 | consumed samples: 17722880 | consumed tokens: 36296458240 | elapsed time per iteration (s): 1.03 | learning rate: 9.664E-05 | global batch size: 256 | lm loss: 1.976525E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.216 | TFLOPs: 41.02 | 15: iteration 69240/ 125429 | consumed samples: 17725440 | consumed tokens: 36301701120 | elapsed time per iteration (s): 1.05 | learning rate: 9.662E-05 | global batch size: 256 | lm loss: 1.935253E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.221 | TFLOPs: 40.36 | 15: iteration 69250/ 125429 | consumed samples: 17728000 | consumed tokens: 36306944000 | elapsed time per iteration (s): 1.03 | learning rate: 9.660E-05 | global batch size: 256 | lm loss: 1.963379E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.379 | TFLOPs: 41.05 | 15: iteration 69260/ 125429 | consumed samples: 17730560 | consumed tokens: 36312186880 | elapsed time per iteration (s): 1.04 | learning rate: 9.657E-05 | global batch size: 256 | lm loss: 1.980642E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.496 | TFLOPs: 40.57 | 15: iteration 69270/ 125429 | consumed samples: 17733120 | consumed tokens: 36317429760 | elapsed time per iteration (s): 1.02 | learning rate: 9.655E-05 | global batch size: 256 | lm loss: 1.957162E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.542 | TFLOPs: 41.57 | 15: iteration 69280/ 125429 | consumed samples: 17735680 | consumed tokens: 36322672640 | elapsed time per iteration (s): 1.02 | learning rate: 9.653E-05 | global batch size: 256 | lm loss: 1.931242E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.028 | TFLOPs: 41.48 | 15: iteration 69290/ 125429 | consumed samples: 17738240 | consumed tokens: 36327915520 | elapsed time per iteration (s): 1.04 | learning rate: 9.651E-05 | global batch size: 256 | lm loss: 1.990299E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.638 | TFLOPs: 40.76 | 15: iteration 69300/ 125429 | consumed samples: 17740800 | consumed tokens: 36333158400 | elapsed time per iteration (s): 1.03 | learning rate: 9.648E-05 | global batch size: 256 | lm loss: 1.971360E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.640 | TFLOPs: 41.09 | 15: iteration 69310/ 125429 | consumed samples: 17743360 | consumed tokens: 36338401280 | elapsed time per iteration (s): 1.03 | learning rate: 9.646E-05 | global batch size: 256 | lm loss: 1.943097E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.031 | TFLOPs: 40.99 | 15: iteration 69320/ 125429 | consumed samples: 17745920 | consumed tokens: 36343644160 | elapsed time per iteration (s): 1.04 | learning rate: 9.644E-05 | global batch size: 256 | lm loss: 1.967859E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.725 | TFLOPs: 40.77 | 15: iteration 69330/ 125429 | consumed samples: 17748480 | consumed tokens: 36348887040 | elapsed time per iteration (s): 1.06 | learning rate: 9.642E-05 | global batch size: 256 | lm loss: 1.979359E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.799 | TFLOPs: 39.96 | 15: iteration 69340/ 125429 | consumed samples: 17751040 | consumed tokens: 36354129920 | elapsed time per iteration (s): 1.03 | learning rate: 9.639E-05 | global batch size: 256 | lm loss: 1.977809E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.667 | TFLOPs: 40.93 | 15: iteration 69350/ 125429 | consumed samples: 17753600 | consumed tokens: 36359372800 | elapsed time per iteration (s): 1.05 | learning rate: 9.637E-05 | global batch size: 256 | lm loss: 1.970050E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.913 | TFLOPs: 40.14 | 15: iteration 69360/ 125429 | consumed samples: 17756160 | consumed tokens: 36364615680 | elapsed time per iteration (s): 1.15 | learning rate: 9.635E-05 | global batch size: 256 | lm loss: 1.977806E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.865 | TFLOPs: 36.83 | 15: iteration 69370/ 125429 | consumed samples: 17758720 | consumed tokens: 36369858560 | elapsed time per iteration (s): 1.03 | learning rate: 9.633E-05 | global batch size: 256 | lm loss: 1.959389E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.789 | TFLOPs: 40.95 | 15: iteration 69380/ 125429 | consumed samples: 17761280 | consumed tokens: 36375101440 | elapsed time per iteration (s): 1.03 | learning rate: 9.630E-05 | global batch size: 256 | lm loss: 1.935228E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.478 | TFLOPs: 40.90 | 15: iteration 69390/ 125429 | consumed samples: 17763840 | consumed tokens: 36380344320 | elapsed time per iteration (s): 1.05 | learning rate: 9.628E-05 | global batch size: 256 | lm loss: 1.968802E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.966 | TFLOPs: 40.32 | 15: iteration 69400/ 125429 | consumed samples: 17766400 | consumed tokens: 36385587200 | elapsed time per iteration (s): 1.11 | learning rate: 9.626E-05 | global batch size: 256 | lm loss: 1.983696E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.739 | TFLOPs: 38.13 | 15: iteration 69410/ 125429 | consumed samples: 17768960 | consumed tokens: 36390830080 | elapsed time per iteration (s): 1.04 | learning rate: 9.624E-05 | global batch size: 256 | lm loss: 1.958138E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.775 | TFLOPs: 40.62 | 15: iteration 69420/ 125429 | consumed samples: 17771520 | consumed tokens: 36396072960 | elapsed time per iteration (s): 1.03 | learning rate: 9.621E-05 | global batch size: 256 | lm loss: 1.957874E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.523 | TFLOPs: 41.07 | 15: iteration 69430/ 125429 | consumed samples: 17774080 | consumed tokens: 36401315840 | elapsed time per iteration (s): 1.04 | learning rate: 9.619E-05 | global batch size: 256 | lm loss: 1.981679E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.417 | TFLOPs: 40.72 | 15: iteration 69440/ 125429 | consumed samples: 17776640 | consumed tokens: 36406558720 | elapsed time per iteration (s): 1.03 | learning rate: 9.617E-05 | global batch size: 256 | lm loss: 1.963576E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.471 | TFLOPs: 41.23 | 15: iteration 69450/ 125429 | consumed samples: 17779200 | consumed tokens: 36411801600 | elapsed time per iteration (s): 1.06 | learning rate: 9.615E-05 | global batch size: 256 | lm loss: 1.956869E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.821 | TFLOPs: 39.96 | 15: iteration 69460/ 125429 | consumed samples: 17781760 | consumed tokens: 36417044480 | elapsed time per iteration (s): 1.03 | learning rate: 9.612E-05 | global batch size: 256 | lm loss: 1.996676E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.404 | TFLOPs: 41.05 | 15: iteration 69470/ 125429 | consumed samples: 17784320 | consumed tokens: 36422287360 | elapsed time per iteration (s): 1.05 | learning rate: 9.610E-05 | global batch size: 256 | lm loss: 1.952398E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.334 | TFLOPs: 40.21 | 15: iteration 69480/ 125429 | consumed samples: 17786880 | consumed tokens: 36427530240 | elapsed time per iteration (s): 1.07 | learning rate: 9.608E-05 | global batch size: 256 | lm loss: 1.981738E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.849 | TFLOPs: 39.64 | 15: iteration 69490/ 125429 | consumed samples: 17789440 | consumed tokens: 36432773120 | elapsed time per iteration (s): 1.03 | learning rate: 9.606E-05 | global batch size: 256 | lm loss: 1.967197E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.896 | TFLOPs: 41.13 | 15: iteration 69500/ 125429 | consumed samples: 17792000 | consumed tokens: 36438016000 | elapsed time per iteration (s): 1.04 | learning rate: 9.603E-05 | global batch size: 256 | lm loss: 1.976804E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.207 | TFLOPs: 40.52 | 15: iteration 69510/ 125429 | consumed samples: 17794560 | consumed tokens: 36443258880 | elapsed time per iteration (s): 1.06 | learning rate: 9.601E-05 | global batch size: 256 | lm loss: 1.963678E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.933 | TFLOPs: 39.98 | 15: iteration 69520/ 125429 | consumed samples: 17797120 | consumed tokens: 36448501760 | elapsed time per iteration (s): 1.03 | learning rate: 9.599E-05 | global batch size: 256 | lm loss: 1.999635E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.347 | TFLOPs: 41.21 | 15: iteration 69530/ 125429 | consumed samples: 17799680 | consumed tokens: 36453744640 | elapsed time per iteration (s): 1.03 | learning rate: 9.597E-05 | global batch size: 256 | lm loss: 1.967056E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.735 | TFLOPs: 41.27 | 15: iteration 69540/ 125429 | consumed samples: 17802240 | consumed tokens: 36458987520 | elapsed time per iteration (s): 1.05 | learning rate: 9.594E-05 | global batch size: 256 | lm loss: 1.948627E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.531 | TFLOPs: 40.25 | 15: iteration 69550/ 125429 | consumed samples: 17804800 | consumed tokens: 36464230400 | elapsed time per iteration (s): 1.02 | learning rate: 9.592E-05 | global batch size: 256 | lm loss: 1.964878E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.084 | TFLOPs: 41.49 | 15: iteration 69560/ 125429 | consumed samples: 17807360 | consumed tokens: 36469473280 | elapsed time per iteration (s): 1.07 | learning rate: 9.590E-05 | global batch size: 256 | lm loss: 1.959011E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.739 | TFLOPs: 39.45 | 15: iteration 69570/ 125429 | consumed samples: 17809920 | consumed tokens: 36474716160 | elapsed time per iteration (s): 1.03 | learning rate: 9.588E-05 | global batch size: 256 | lm loss: 1.985783E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.573 | TFLOPs: 41.08 | 15: iteration 69580/ 125429 | consumed samples: 17812480 | consumed tokens: 36479959040 | elapsed time per iteration (s): 1.02 | learning rate: 9.585E-05 | global batch size: 256 | lm loss: 1.972482E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.805 | TFLOPs: 41.28 | 15: iteration 69590/ 125429 | consumed samples: 17815040 | consumed tokens: 36485201920 | elapsed time per iteration (s): 1.04 | learning rate: 9.583E-05 | global batch size: 256 | lm loss: 1.980968E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.359 | TFLOPs: 40.55 | 15: iteration 69600/ 125429 | consumed samples: 17817600 | consumed tokens: 36490444800 | elapsed time per iteration (s): 1.07 | learning rate: 9.581E-05 | global batch size: 256 | lm loss: 2.007813E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.770 | TFLOPs: 39.62 | 15: iteration 69610/ 125429 | consumed samples: 17820160 | consumed tokens: 36495687680 | elapsed time per iteration (s): 1.03 | learning rate: 9.579E-05 | global batch size: 256 | lm loss: 1.960289E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.453 | TFLOPs: 40.89 | 15: iteration 69620/ 125429 | consumed samples: 17822720 | consumed tokens: 36500930560 | elapsed time per iteration (s): 1.02 | learning rate: 9.576E-05 | global batch size: 256 | lm loss: 1.966963E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.150 | TFLOPs: 41.34 | 15: iteration 69630/ 125429 | consumed samples: 17825280 | consumed tokens: 36506173440 | elapsed time per iteration (s): 1.03 | learning rate: 9.574E-05 | global batch size: 256 | lm loss: 1.938880E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.390 | TFLOPs: 41.05 | 15: iteration 69640/ 125429 | consumed samples: 17827840 | consumed tokens: 36511416320 | elapsed time per iteration (s): 1.04 | learning rate: 9.572E-05 | global batch size: 256 | lm loss: 1.988021E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.702 | TFLOPs: 40.77 | 15: iteration 69650/ 125429 | consumed samples: 17830400 | consumed tokens: 36516659200 | elapsed time per iteration (s): 1.02 | learning rate: 9.570E-05 | global batch size: 256 | lm loss: 1.948260E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.368 | TFLOPs: 41.54 | 15: iteration 69660/ 125429 | consumed samples: 17832960 | consumed tokens: 36521902080 | elapsed time per iteration (s): 1.03 | learning rate: 9.567E-05 | global batch size: 256 | lm loss: 1.971968E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.822 | TFLOPs: 40.95 | 15: iteration 69670/ 125429 | consumed samples: 17835520 | consumed tokens: 36527144960 | elapsed time per iteration (s): 1.05 | learning rate: 9.565E-05 | global batch size: 256 | lm loss: 1.964285E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.847 | TFLOPs: 40.30 | 15: iteration 69680/ 125429 | consumed samples: 17838080 | consumed tokens: 36532387840 | elapsed time per iteration (s): 1.08 | learning rate: 9.563E-05 | global batch size: 256 | lm loss: 1.968900E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.365 | TFLOPs: 39.06 | 15: iteration 69690/ 125429 | consumed samples: 17840640 | consumed tokens: 36537630720 | elapsed time per iteration (s): 1.03 | learning rate: 9.561E-05 | global batch size: 256 | lm loss: 1.932431E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.637 | TFLOPs: 41.09 | 15: iteration 69700/ 125429 | consumed samples: 17843200 | consumed tokens: 36542873600 | elapsed time per iteration (s): 1.06 | learning rate: 9.558E-05 | global batch size: 256 | lm loss: 1.954693E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.423 | TFLOPs: 40.06 | 15: iteration 69710/ 125429 | consumed samples: 17845760 | consumed tokens: 36548116480 | elapsed time per iteration (s): 1.03 | learning rate: 9.556E-05 | global batch size: 256 | lm loss: 2.011370E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.054 | TFLOPs: 40.99 | 15: iteration 69720/ 125429 | consumed samples: 17848320 | consumed tokens: 36553359360 | elapsed time per iteration (s): 1.03 | learning rate: 9.554E-05 | global batch size: 256 | lm loss: 1.974512E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.704 | TFLOPs: 40.93 | 15: iteration 69730/ 125429 | consumed samples: 17850880 | consumed tokens: 36558602240 | elapsed time per iteration (s): 1.05 | learning rate: 9.552E-05 | global batch size: 256 | lm loss: 1.958459E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.964 | TFLOPs: 40.32 | 15: iteration 69740/ 125429 | consumed samples: 17853440 | consumed tokens: 36563845120 | elapsed time per iteration (s): 1.05 | learning rate: 9.549E-05 | global batch size: 256 | lm loss: 1.959798E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.230 | TFLOPs: 40.36 | 15: iteration 69750/ 125429 | consumed samples: 17856000 | consumed tokens: 36569088000 | elapsed time per iteration (s): 1.06 | learning rate: 9.547E-05 | global batch size: 256 | lm loss: 1.965935E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.626 | TFLOPs: 39.77 | 15: iteration 69760/ 125429 | consumed samples: 17858560 | consumed tokens: 36574330880 | elapsed time per iteration (s): 1.06 | learning rate: 9.545E-05 | global batch size: 256 | lm loss: 1.938660E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.751 | TFLOPs: 39.95 | 15: iteration 69770/ 125429 | consumed samples: 17861120 | consumed tokens: 36579573760 | elapsed time per iteration (s): 1.03 | learning rate: 9.543E-05 | global batch size: 256 | lm loss: 1.946122E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.167 | TFLOPs: 41.18 | 15: iteration 69780/ 125429 | consumed samples: 17863680 | consumed tokens: 36584816640 | elapsed time per iteration (s): 1.17 | learning rate: 9.540E-05 | global batch size: 256 | lm loss: 1.994859E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.887 | TFLOPs: 36.01 | 15: iteration 69790/ 125429 | consumed samples: 17866240 | consumed tokens: 36590059520 | elapsed time per iteration (s): 1.03 | learning rate: 9.538E-05 | global batch size: 256 | lm loss: 1.975582E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.593 | TFLOPs: 40.92 | 15: iteration 69800/ 125429 | consumed samples: 17868800 | consumed tokens: 36595302400 | elapsed time per iteration (s): 1.04 | learning rate: 9.536E-05 | global batch size: 256 | lm loss: 1.962014E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.539 | TFLOPs: 40.58 | 15: iteration 69810/ 125429 | consumed samples: 17871360 | consumed tokens: 36600545280 | elapsed time per iteration (s): 1.03 | learning rate: 9.534E-05 | global batch size: 256 | lm loss: 1.950881E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.293 | TFLOPs: 41.20 | 15: iteration 69820/ 125429 | consumed samples: 17873920 | consumed tokens: 36605788160 | elapsed time per iteration (s): 1.03 | learning rate: 9.531E-05 | global batch size: 256 | lm loss: 1.930761E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.232 | TFLOPs: 41.02 | 15: iteration 69830/ 125429 | consumed samples: 17876480 | consumed tokens: 36611031040 | elapsed time per iteration (s): 1.02 | learning rate: 9.529E-05 | global batch size: 256 | lm loss: 1.959295E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.167 | TFLOPs: 41.34 | 15: iteration 69840/ 125429 | consumed samples: 17879040 | consumed tokens: 36616273920 | elapsed time per iteration (s): 1.02 | learning rate: 9.527E-05 | global batch size: 256 | lm loss: 1.952808E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.792 | TFLOPs: 41.28 | 15: iteration 69850/ 125429 | consumed samples: 17881600 | consumed tokens: 36621516800 | elapsed time per iteration (s): 1.04 | learning rate: 9.525E-05 | global batch size: 256 | lm loss: 1.962058E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.056 | TFLOPs: 40.66 | 15: iteration 69860/ 125429 | consumed samples: 17884160 | consumed tokens: 36626759680 | elapsed time per iteration (s): 1.03 | learning rate: 9.523E-05 | global batch size: 256 | lm loss: 1.975813E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.832 | TFLOPs: 40.96 | 15: iteration 69870/ 125429 | consumed samples: 17886720 | consumed tokens: 36632002560 | elapsed time per iteration (s): 1.07 | learning rate: 9.520E-05 | global batch size: 256 | lm loss: 1.967964E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.173 | TFLOPs: 39.36 | 15: iteration 69880/ 125429 | consumed samples: 17889280 | consumed tokens: 36637245440 | elapsed time per iteration (s): 1.05 | learning rate: 9.518E-05 | global batch size: 256 | lm loss: 1.952825E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.364 | TFLOPs: 40.22 | 15: iteration 69890/ 125429 | consumed samples: 17891840 | consumed tokens: 36642488320 | elapsed time per iteration (s): 1.04 | learning rate: 9.516E-05 | global batch size: 256 | lm loss: 1.956369E+00 | grad norm: 0.127 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.140 | TFLOPs: 40.68 | 15: iteration 69900/ 125429 | consumed samples: 17894400 | consumed tokens: 36647731200 | elapsed time per iteration (s): 1.15 | learning rate: 9.514E-05 | global batch size: 256 | lm loss: 1.977074E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.090 | TFLOPs: 36.70 | 15: iteration 69910/ 125429 | consumed samples: 17896960 | consumed tokens: 36652974080 | elapsed time per iteration (s): 1.04 | learning rate: 9.511E-05 | global batch size: 256 | lm loss: 1.956202E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.135 | TFLOPs: 40.68 | 15: iteration 69920/ 125429 | consumed samples: 17899520 | consumed tokens: 36658216960 | elapsed time per iteration (s): 1.02 | learning rate: 9.509E-05 | global batch size: 256 | lm loss: 2.007762E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.642 | TFLOPs: 41.59 | 15: iteration 69930/ 125429 | consumed samples: 17902080 | consumed tokens: 36663459840 | elapsed time per iteration (s): 1.05 | learning rate: 9.507E-05 | global batch size: 256 | lm loss: 1.977727E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.926 | TFLOPs: 40.31 | 15: iteration 69940/ 125429 | consumed samples: 17904640 | consumed tokens: 36668702720 | elapsed time per iteration (s): 1.03 | learning rate: 9.505E-05 | global batch size: 256 | lm loss: 1.977481E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.470 | TFLOPs: 41.06 | 15: iteration 69950/ 125429 | consumed samples: 17907200 | consumed tokens: 36673945600 | elapsed time per iteration (s): 1.05 | learning rate: 9.502E-05 | global batch size: 256 | lm loss: 1.953616E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.226 | TFLOPs: 40.19 | 15: iteration 69960/ 125429 | consumed samples: 17909760 | consumed tokens: 36679188480 | elapsed time per iteration (s): 1.02 | learning rate: 9.500E-05 | global batch size: 256 | lm loss: 1.961902E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.395 | TFLOPs: 41.55 | 15: iteration 69970/ 125429 | consumed samples: 17912320 | consumed tokens: 36684431360 | elapsed time per iteration (s): 1.05 | learning rate: 9.498E-05 | global batch size: 256 | lm loss: 1.979944E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.954 | TFLOPs: 40.32 | 15: iteration 69980/ 125429 | consumed samples: 17914880 | consumed tokens: 36689674240 | elapsed time per iteration (s): 1.07 | learning rate: 9.496E-05 | global batch size: 256 | lm loss: 1.972325E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.224 | TFLOPs: 39.37 | 15: iteration 69990/ 125429 | consumed samples: 17917440 | consumed tokens: 36694917120 | elapsed time per iteration (s): 1.17 | learning rate: 9.493E-05 | global batch size: 256 | lm loss: 1.959281E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.450 | TFLOPs: 36.27 | 0: [2022-11-26 16:48:30,680] [INFO] [logging.py:68:log_dist] [Rank 0] step=70000, skipped=0, lr=[9.491073244441717e-05, 9.491073244441717e-05, 9.491073244441717e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 70000/ 125429 | consumed samples: 17920000 | consumed tokens: 36700160000 | elapsed time per iteration (s): 1.06 | learning rate: 9.491E-05 | global batch size: 256 | lm loss: 1.977896E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.037 | TFLOPs: 40.00 | 0: steps: 70000 loss: 1.9775 iter time (s): 1.037 samples/sec: 246.906 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 70000 | lm loss value: 1.967605E+00 | lm loss PPL: 7.153521E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 70000 to checkpoints_1b5 0: [2022-11-26 16:48:31,112] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step70000 is begin to save! 0: [2022-11-26 16:48:31,121] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_01-model_00-model_states.pt... 0: [2022-11-26 16:48:31,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_01-model_00-model_states.pt. 0: [2022-11-26 16:48:31,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_03-model_00-model_states.pt... 0: [2022-11-26 16:48:31,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_03-model_00-model_states.pt. 0: [2022-11-26 16:48:31,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_04-model_00-model_states.pt... 0: [2022-11-26 16:48:31,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_04-model_00-model_states.pt. 0: [2022-11-26 16:48:31,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_05-model_00-model_states.pt... 0: [2022-11-26 16:48:31,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_05-model_00-model_states.pt. 0: [2022-11-26 16:48:31,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_06-model_00-model_states.pt... 0: [2022-11-26 16:48:31,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_06-model_00-model_states.pt. 0: [2022-11-26 16:48:31,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_07-model_00-model_states.pt... 0: [2022-11-26 16:48:31,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_07-model_00-model_states.pt. 0: [2022-11-26 16:48:31,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_08-model_00-model_states.pt... 0: [2022-11-26 16:48:32,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_08-model_00-model_states.pt. 0: [2022-11-26 16:48:32,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_09-model_00-model_states.pt... 0: [2022-11-26 16:48:32,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_09-model_00-model_states.pt. 0: [2022-11-26 16:48:32,150] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_10-model_00-model_states.pt... 0: [2022-11-26 16:48:32,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_10-model_00-model_states.pt. 0: [2022-11-26 16:48:32,252] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_11-model_00-model_states.pt... 0: [2022-11-26 16:48:32,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_11-model_00-model_states.pt. 0: [2022-11-26 16:48:32,361] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_12-model_00-model_states.pt... 0: [2022-11-26 16:48:32,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_12-model_00-model_states.pt. 0: [2022-11-26 16:48:32,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_13-model_00-model_states.pt... 0: [2022-11-26 16:48:32,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_13-model_00-model_states.pt. 0: [2022-11-26 16:48:32,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_14-model_00-model_states.pt... 0: [2022-11-26 16:48:32,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_14-model_00-model_states.pt. 0: [2022-11-26 16:48:32,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_15-model_00-model_states.pt... 0: [2022-11-26 16:48:32,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_15-model_00-model_states.pt. 0: [2022-11-26 16:48:32,796] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_16-model_00-model_states.pt... 0: [2022-11-26 16:48:32,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_16-model_00-model_states.pt. 0: [2022-11-26 16:48:32,910] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_17-model_00-model_states.pt... 0: [2022-11-26 16:48:33,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_17-model_00-model_states.pt. 0: [2022-11-26 16:48:33,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_18-model_00-model_states.pt... 0: [2022-11-26 16:48:33,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_18-model_00-model_states.pt. 0: [2022-11-26 16:48:33,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_19-model_00-model_states.pt... 0: [2022-11-26 16:48:33,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_19-model_00-model_states.pt. 0: [2022-11-26 16:48:33,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_20-model_00-model_states.pt... 0: [2022-11-26 16:48:33,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_20-model_00-model_states.pt. 0: [2022-11-26 16:48:33,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_21-model_00-model_states.pt... 0: [2022-11-26 16:48:33,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_21-model_00-model_states.pt. 0: [2022-11-26 16:48:33,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_22-model_00-model_states.pt... 0: [2022-11-26 16:48:33,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_22-model_00-model_states.pt. 0: [2022-11-26 16:48:33,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_23-model_00-model_states.pt... 0: [2022-11-26 16:48:33,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_23-model_00-model_states.pt. 0: [2022-11-26 16:48:33,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_24-model_00-model_states.pt... 0: [2022-11-26 16:48:33,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_24-model_00-model_states.pt. 0: [2022-11-26 16:48:33,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_25-model_00-model_states.pt... 0: [2022-11-26 16:48:33,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_25-model_00-model_states.pt. 0: [2022-11-26 16:48:33,898] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_26-model_00-model_states.pt... 0: [2022-11-26 16:48:34,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_26-model_00-model_states.pt. 0: [2022-11-26 16:48:34,002] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_27-model_00-model_states.pt... 0: [2022-11-26 16:48:34,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_27-model_00-model_states.pt. 0: [2022-11-26 16:48:34,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_28-model_00-model_states.pt... 0: [2022-11-26 16:48:34,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_28-model_00-model_states.pt. 0: [2022-11-26 16:48:34,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_29-model_00-model_states.pt... 0: [2022-11-26 16:48:34,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_29-model_00-model_states.pt. 0: [2022-11-26 16:48:34,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_30-model_00-model_states.pt... 0: [2022-11-26 16:48:34,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_30-model_00-model_states.pt. 0: [2022-11-26 16:48:34,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/layer_32-model_00-model_states.pt... 0: [2022-11-26 16:48:34,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/layer_32-model_00-model_states.pt. 0: [2022-11-26 16:48:34,445] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step70000/mp_rank_00_model_states.pt 0: [2022-11-26 16:48:34,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/mp_rank_00_model_states.pt... 0: [2022-11-26 16:48:34,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/mp_rank_00_model_states.pt. 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 6: [2022-11-26 16:48:37,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step70000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 11: [2022-11-26 16:48:37,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:48:37,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:48:37,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 16:48:37,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 16:48:37,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 16:48:37,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 16:48:37,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 16:48:37,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:48:37,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,254] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 16:48:37,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:48:37,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:48:37,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 16:48:37,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 16:48:37,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 16:48:37,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:48:37,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 16:48:37,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 16:48:37,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 16:48:37,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 4: [2022-11-26 16:48:37,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 16:48:37,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 16:48:37,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 16:48:37,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 10: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 16:48:37,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 16:48:37,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 16:48:37,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 16:48:37,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 16:48:37,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 5: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 12: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 16:48:37,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 16:48:37,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:48:37,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:48:37,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:48:37,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:48:37,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:48:37,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:48:37,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 15: [2022-11-26 16:48:37,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 16:48:37,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 16:48:37,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 16:48:37,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:48:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 16:48:37,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:48:37,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:48:37,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:48:37,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:48:37,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:48:37,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:48:37,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 8: [2022-11-26 16:48:37,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 16:48:37,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 16:48:37,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 16:48:37,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 2: [2022-11-26 16:48:37,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 16:48:37,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 16:48:37,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 16:48:37,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 6: [2022-11-26 16:48:37,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 16:48:37,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 16:48:37,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:48:37,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:48:37,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 7: [2022-11-26 16:48:37,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 16:48:37,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 16:48:37,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 9: [2022-11-26 16:48:37,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 16:48:37,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 16:48:37,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 16:48:37,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,371] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 16:48:37,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 11: [2022-11-26 16:48:37,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 16:48:37,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 16:48:37,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 14: [2022-11-26 16:48:37,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 16:48:37,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 16:48:37,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 3: [2022-11-26 16:48:37,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 13: [2022-11-26 16:48:37,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 16:48:37,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: [2022-11-26 16:48:37,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 16:48:37,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step70000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 1: [2022-11-26 16:48:37,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step70000 is ready now! 0: successfully saved checkpoint at iteration 70000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 6422.27 15: iteration 70010/ 125429 | consumed samples: 17922560 | consumed tokens: 36705402880 | elapsed time per iteration (s): 1.74 | learning rate: 9.489E-05 | global batch size: 256 | lm loss: 1.968786E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 147.152 | TFLOPs: 24.32 | 15: iteration 70020/ 125429 | consumed samples: 17925120 | consumed tokens: 36710645760 | elapsed time per iteration (s): 1.17 | learning rate: 9.487E-05 | global batch size: 256 | lm loss: 1.946914E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.014 | TFLOPs: 36.03 | 15: iteration 70030/ 125429 | consumed samples: 17927680 | consumed tokens: 36715888640 | elapsed time per iteration (s): 1.03 | learning rate: 9.484E-05 | global batch size: 256 | lm loss: 1.991701E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.638 | TFLOPs: 40.92 | 15: iteration 70040/ 125429 | consumed samples: 17930240 | consumed tokens: 36721131520 | elapsed time per iteration (s): 1.03 | learning rate: 9.482E-05 | global batch size: 256 | lm loss: 1.950553E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.786 | TFLOPs: 41.11 | 15: iteration 70050/ 125429 | consumed samples: 17932800 | consumed tokens: 36726374400 | elapsed time per iteration (s): 1.13 | learning rate: 9.480E-05 | global batch size: 256 | lm loss: 1.942437E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.723 | TFLOPs: 37.30 | 15: iteration 70060/ 125429 | consumed samples: 17935360 | consumed tokens: 36731617280 | elapsed time per iteration (s): 1.02 | learning rate: 9.478E-05 | global batch size: 256 | lm loss: 1.925093E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.264 | TFLOPs: 41.52 | 15: iteration 70070/ 125429 | consumed samples: 17937920 | consumed tokens: 36736860160 | elapsed time per iteration (s): 1.05 | learning rate: 9.475E-05 | global batch size: 256 | lm loss: 1.983955E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.751 | TFLOPs: 40.28 | 15: iteration 70080/ 125429 | consumed samples: 17940480 | consumed tokens: 36742103040 | elapsed time per iteration (s): 1.04 | learning rate: 9.473E-05 | global batch size: 256 | lm loss: 1.982500E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.020 | TFLOPs: 40.66 | 15: iteration 70090/ 125429 | consumed samples: 17943040 | consumed tokens: 36747345920 | elapsed time per iteration (s): 1.03 | learning rate: 9.471E-05 | global batch size: 256 | lm loss: 1.960722E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.898 | TFLOPs: 41.13 | 15: iteration 70100/ 125429 | consumed samples: 17945600 | consumed tokens: 36752588800 | elapsed time per iteration (s): 1.04 | learning rate: 9.469E-05 | global batch size: 256 | lm loss: 1.965911E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.056 | TFLOPs: 40.50 | 15: iteration 70110/ 125429 | consumed samples: 17948160 | consumed tokens: 36757831680 | elapsed time per iteration (s): 1.04 | learning rate: 9.466E-05 | global batch size: 256 | lm loss: 1.950396E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.131 | TFLOPs: 40.84 | 15: iteration 70120/ 125429 | consumed samples: 17950720 | consumed tokens: 36763074560 | elapsed time per iteration (s): 1.04 | learning rate: 9.464E-05 | global batch size: 256 | lm loss: 1.954118E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.413 | TFLOPs: 40.72 | 15: iteration 70130/ 125429 | consumed samples: 17953280 | consumed tokens: 36768317440 | elapsed time per iteration (s): 1.03 | learning rate: 9.462E-05 | global batch size: 256 | lm loss: 1.994573E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.167 | TFLOPs: 41.18 | 15: iteration 70140/ 125429 | consumed samples: 17955840 | consumed tokens: 36773560320 | elapsed time per iteration (s): 1.04 | learning rate: 9.460E-05 | global batch size: 256 | lm loss: 1.964957E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.287 | TFLOPs: 40.87 | 15: iteration 70150/ 125429 | consumed samples: 17958400 | consumed tokens: 36778803200 | elapsed time per iteration (s): 1.02 | learning rate: 9.457E-05 | global batch size: 256 | lm loss: 1.980115E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.182 | TFLOPs: 41.34 | 15: iteration 70160/ 125429 | consumed samples: 17960960 | consumed tokens: 36784046080 | elapsed time per iteration (s): 1.04 | learning rate: 9.455E-05 | global batch size: 256 | lm loss: 1.975116E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.500 | TFLOPs: 40.74 | 15: iteration 70170/ 125429 | consumed samples: 17963520 | consumed tokens: 36789288960 | elapsed time per iteration (s): 1.05 | learning rate: 9.453E-05 | global batch size: 256 | lm loss: 1.977374E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.252 | TFLOPs: 40.20 | 15: iteration 70180/ 125429 | consumed samples: 17966080 | consumed tokens: 36794531840 | elapsed time per iteration (s): 1.16 | learning rate: 9.451E-05 | global batch size: 256 | lm loss: 1.978435E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.299 | TFLOPs: 36.57 | 15: iteration 70190/ 125429 | consumed samples: 17968640 | consumed tokens: 36799774720 | elapsed time per iteration (s): 1.22 | learning rate: 9.448E-05 | global batch size: 256 | lm loss: 1.973287E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.525 | TFLOPs: 34.79 | 15: iteration 70200/ 125429 | consumed samples: 17971200 | consumed tokens: 36805017600 | elapsed time per iteration (s): 1.02 | learning rate: 9.446E-05 | global batch size: 256 | lm loss: 1.966286E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.481 | TFLOPs: 41.39 | 15: iteration 70210/ 125429 | consumed samples: 17973760 | consumed tokens: 36810260480 | elapsed time per iteration (s): 1.05 | learning rate: 9.444E-05 | global batch size: 256 | lm loss: 1.922805E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.526 | TFLOPs: 40.41 | 15: iteration 70220/ 125429 | consumed samples: 17976320 | consumed tokens: 36815503360 | elapsed time per iteration (s): 1.03 | learning rate: 9.442E-05 | global batch size: 256 | lm loss: 1.956068E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.189 | TFLOPs: 41.18 | 15: iteration 70230/ 125429 | consumed samples: 17978880 | consumed tokens: 36820746240 | elapsed time per iteration (s): 1.04 | learning rate: 9.439E-05 | global batch size: 256 | lm loss: 1.961995E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.594 | TFLOPs: 40.75 | 15: iteration 70240/ 125429 | consumed samples: 17981440 | consumed tokens: 36825989120 | elapsed time per iteration (s): 1.03 | learning rate: 9.437E-05 | global batch size: 256 | lm loss: 1.951341E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.280 | TFLOPs: 41.20 | 15: iteration 70250/ 125429 | consumed samples: 17984000 | consumed tokens: 36831232000 | elapsed time per iteration (s): 1.03 | learning rate: 9.435E-05 | global batch size: 256 | lm loss: 1.968768E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.318 | TFLOPs: 41.20 | 15: iteration 70260/ 125429 | consumed samples: 17986560 | consumed tokens: 36836474880 | elapsed time per iteration (s): 1.02 | learning rate: 9.433E-05 | global batch size: 256 | lm loss: 1.971165E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.280 | TFLOPs: 41.36 | 15: iteration 70270/ 125429 | consumed samples: 17989120 | consumed tokens: 36841717760 | elapsed time per iteration (s): 1.15 | learning rate: 9.431E-05 | global batch size: 256 | lm loss: 1.967654E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.521 | TFLOPs: 36.77 | 15: iteration 70280/ 125429 | consumed samples: 17991680 | consumed tokens: 36846960640 | elapsed time per iteration (s): 1.05 | learning rate: 9.428E-05 | global batch size: 256 | lm loss: 1.984256E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.363 | TFLOPs: 40.22 | 15: iteration 70290/ 125429 | consumed samples: 17994240 | consumed tokens: 36852203520 | elapsed time per iteration (s): 1.05 | learning rate: 9.426E-05 | global batch size: 256 | lm loss: 1.968320E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.911 | TFLOPs: 40.31 | 15: iteration 70300/ 125429 | consumed samples: 17996800 | consumed tokens: 36857446400 | elapsed time per iteration (s): 1.02 | learning rate: 9.424E-05 | global batch size: 256 | lm loss: 1.969005E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.247 | TFLOPs: 41.36 | 15: iteration 70310/ 125429 | consumed samples: 17999360 | consumed tokens: 36862689280 | elapsed time per iteration (s): 1.03 | learning rate: 9.422E-05 | global batch size: 256 | lm loss: 1.995834E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.736 | TFLOPs: 41.27 | 15: iteration 70320/ 125429 | consumed samples: 18001920 | consumed tokens: 36867932160 | elapsed time per iteration (s): 1.04 | learning rate: 9.419E-05 | global batch size: 256 | lm loss: 1.955644E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.269 | TFLOPs: 40.53 | 15: iteration 70330/ 125429 | consumed samples: 18004480 | consumed tokens: 36873175040 | elapsed time per iteration (s): 1.05 | learning rate: 9.417E-05 | global batch size: 256 | lm loss: 1.962350E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.419 | TFLOPs: 40.39 | 15: iteration 70340/ 125429 | consumed samples: 18007040 | consumed tokens: 36878417920 | elapsed time per iteration (s): 1.05 | learning rate: 9.415E-05 | global batch size: 256 | lm loss: 1.970083E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.816 | TFLOPs: 40.13 | 15: iteration 70350/ 125429 | consumed samples: 18009600 | consumed tokens: 36883660800 | elapsed time per iteration (s): 1.05 | learning rate: 9.413E-05 | global batch size: 256 | lm loss: 1.983626E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.109 | TFLOPs: 40.18 | 15: iteration 70360/ 125429 | consumed samples: 18012160 | consumed tokens: 36888903680 | elapsed time per iteration (s): 1.05 | learning rate: 9.410E-05 | global batch size: 256 | lm loss: 1.974775E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.322 | TFLOPs: 40.21 | 15: iteration 70370/ 125429 | consumed samples: 18014720 | consumed tokens: 36894146560 | elapsed time per iteration (s): 1.06 | learning rate: 9.408E-05 | global batch size: 256 | lm loss: 1.949173E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.447 | TFLOPs: 40.07 | 15: iteration 70380/ 125429 | consumed samples: 18017280 | consumed tokens: 36899389440 | elapsed time per iteration (s): 1.07 | learning rate: 9.406E-05 | global batch size: 256 | lm loss: 1.964370E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.792 | TFLOPs: 39.46 | 15: iteration 70390/ 125429 | consumed samples: 18019840 | consumed tokens: 36904632320 | elapsed time per iteration (s): 1.06 | learning rate: 9.404E-05 | global batch size: 256 | lm loss: 1.980212E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.442 | TFLOPs: 40.07 | 15: iteration 70400/ 125429 | consumed samples: 18022400 | consumed tokens: 36909875200 | elapsed time per iteration (s): 1.02 | learning rate: 9.401E-05 | global batch size: 256 | lm loss: 1.980667E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.058 | TFLOPs: 41.32 | 15: iteration 70410/ 125429 | consumed samples: 18024960 | consumed tokens: 36915118080 | elapsed time per iteration (s): 1.09 | learning rate: 9.399E-05 | global batch size: 256 | lm loss: 1.974711E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.089 | TFLOPs: 38.69 | 15: iteration 70420/ 125429 | consumed samples: 18027520 | consumed tokens: 36920360960 | elapsed time per iteration (s): 1.04 | learning rate: 9.397E-05 | global batch size: 256 | lm loss: 1.979573E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.509 | TFLOPs: 40.57 | 15: iteration 70430/ 125429 | consumed samples: 18030080 | consumed tokens: 36925603840 | elapsed time per iteration (s): 1.12 | learning rate: 9.395E-05 | global batch size: 256 | lm loss: 1.994698E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.234 | TFLOPs: 37.88 | 15: iteration 70440/ 125429 | consumed samples: 18032640 | consumed tokens: 36930846720 | elapsed time per iteration (s): 1.04 | learning rate: 9.392E-05 | global batch size: 256 | lm loss: 1.964603E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.394 | TFLOPs: 40.72 | 15: iteration 70450/ 125429 | consumed samples: 18035200 | consumed tokens: 36936089600 | elapsed time per iteration (s): 1.07 | learning rate: 9.390E-05 | global batch size: 256 | lm loss: 1.959592E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.074 | TFLOPs: 39.51 | 15: iteration 70460/ 125429 | consumed samples: 18037760 | consumed tokens: 36941332480 | elapsed time per iteration (s): 1.03 | learning rate: 9.388E-05 | global batch size: 256 | lm loss: 1.932856E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.661 | TFLOPs: 41.09 | 15: iteration 70470/ 125429 | consumed samples: 18040320 | consumed tokens: 36946575360 | elapsed time per iteration (s): 1.07 | learning rate: 9.386E-05 | global batch size: 256 | lm loss: 1.963031E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.218 | TFLOPs: 39.70 | 15: iteration 70480/ 125429 | consumed samples: 18042880 | consumed tokens: 36951818240 | elapsed time per iteration (s): 1.05 | learning rate: 9.383E-05 | global batch size: 256 | lm loss: 1.953209E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.732 | TFLOPs: 40.44 | 15: iteration 70490/ 125429 | consumed samples: 18045440 | consumed tokens: 36957061120 | elapsed time per iteration (s): 1.06 | learning rate: 9.381E-05 | global batch size: 256 | lm loss: 1.958918E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.267 | TFLOPs: 40.04 | 15: iteration 70500/ 125429 | consumed samples: 18048000 | consumed tokens: 36962304000 | elapsed time per iteration (s): 1.05 | learning rate: 9.379E-05 | global batch size: 256 | lm loss: 1.981341E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.940 | TFLOPs: 40.15 | 15: iteration 70510/ 125429 | consumed samples: 18050560 | consumed tokens: 36967546880 | elapsed time per iteration (s): 1.05 | learning rate: 9.377E-05 | global batch size: 256 | lm loss: 1.957716E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.857 | TFLOPs: 40.46 | 15: iteration 70520/ 125429 | consumed samples: 18053120 | consumed tokens: 36972789760 | elapsed time per iteration (s): 1.03 | learning rate: 9.374E-05 | global batch size: 256 | lm loss: 1.953738E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.440 | TFLOPs: 41.06 | 15: iteration 70530/ 125429 | consumed samples: 18055680 | consumed tokens: 36978032640 | elapsed time per iteration (s): 1.03 | learning rate: 9.372E-05 | global batch size: 256 | lm loss: 1.973823E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.000 | TFLOPs: 41.15 | 15: iteration 70540/ 125429 | consumed samples: 18058240 | consumed tokens: 36983275520 | elapsed time per iteration (s): 1.03 | learning rate: 9.370E-05 | global batch size: 256 | lm loss: 1.956478E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.510 | TFLOPs: 41.23 | 15: iteration 70550/ 125429 | consumed samples: 18060800 | consumed tokens: 36988518400 | elapsed time per iteration (s): 1.11 | learning rate: 9.368E-05 | global batch size: 256 | lm loss: 1.953517E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.658 | TFLOPs: 38.28 | 15: iteration 70560/ 125429 | consumed samples: 18063360 | consumed tokens: 36993761280 | elapsed time per iteration (s): 1.06 | learning rate: 9.366E-05 | global batch size: 256 | lm loss: 1.952192E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.149 | TFLOPs: 40.02 | 15: iteration 70570/ 125429 | consumed samples: 18065920 | consumed tokens: 36999004160 | elapsed time per iteration (s): 1.03 | learning rate: 9.363E-05 | global batch size: 256 | lm loss: 1.956744E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.889 | TFLOPs: 41.13 | 15: iteration 70580/ 125429 | consumed samples: 18068480 | consumed tokens: 37004247040 | elapsed time per iteration (s): 1.03 | learning rate: 9.361E-05 | global batch size: 256 | lm loss: 1.981122E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.339 | TFLOPs: 41.04 | 15: iteration 70590/ 125429 | consumed samples: 18071040 | consumed tokens: 37009489920 | elapsed time per iteration (s): 1.06 | learning rate: 9.359E-05 | global batch size: 256 | lm loss: 1.947431E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.563 | TFLOPs: 39.92 | 15: iteration 70600/ 125429 | consumed samples: 18073600 | consumed tokens: 37014732800 | elapsed time per iteration (s): 1.03 | learning rate: 9.357E-05 | global batch size: 256 | lm loss: 1.981682E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.685 | TFLOPs: 41.26 | 15: iteration 70610/ 125429 | consumed samples: 18076160 | consumed tokens: 37019975680 | elapsed time per iteration (s): 1.11 | learning rate: 9.354E-05 | global batch size: 256 | lm loss: 1.973049E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.732 | TFLOPs: 37.97 | 15: iteration 70620/ 125429 | consumed samples: 18078720 | consumed tokens: 37025218560 | elapsed time per iteration (s): 1.07 | learning rate: 9.352E-05 | global batch size: 256 | lm loss: 1.963600E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.557 | TFLOPs: 39.42 | 15: iteration 70630/ 125429 | consumed samples: 18081280 | consumed tokens: 37030461440 | elapsed time per iteration (s): 1.04 | learning rate: 9.350E-05 | global batch size: 256 | lm loss: 1.974986E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.234 | TFLOPs: 40.69 | 15: iteration 70640/ 125429 | consumed samples: 18083840 | consumed tokens: 37035704320 | elapsed time per iteration (s): 1.04 | learning rate: 9.348E-05 | global batch size: 256 | lm loss: 1.981590E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.086 | TFLOPs: 40.67 | 15: iteration 70650/ 125429 | consumed samples: 18086400 | consumed tokens: 37040947200 | elapsed time per iteration (s): 1.02 | learning rate: 9.345E-05 | global batch size: 256 | lm loss: 1.974638E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.458 | TFLOPs: 41.39 | 15: iteration 70660/ 125429 | consumed samples: 18088960 | consumed tokens: 37046190080 | elapsed time per iteration (s): 1.04 | learning rate: 9.343E-05 | global batch size: 256 | lm loss: 1.989309E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.602 | TFLOPs: 40.75 | 15: iteration 70670/ 125429 | consumed samples: 18091520 | consumed tokens: 37051432960 | elapsed time per iteration (s): 1.04 | learning rate: 9.341E-05 | global batch size: 256 | lm loss: 1.968202E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.181 | TFLOPs: 40.52 | 15: iteration 70680/ 125429 | consumed samples: 18094080 | consumed tokens: 37056675840 | elapsed time per iteration (s): 1.04 | learning rate: 9.339E-05 | global batch size: 256 | lm loss: 1.985307E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.153 | TFLOPs: 40.68 | 15: iteration 70690/ 125429 | consumed samples: 18096640 | consumed tokens: 37061918720 | elapsed time per iteration (s): 1.07 | learning rate: 9.336E-05 | global batch size: 256 | lm loss: 1.954297E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.401 | TFLOPs: 39.56 | 15: iteration 70700/ 125429 | consumed samples: 18099200 | consumed tokens: 37067161600 | elapsed time per iteration (s): 1.07 | learning rate: 9.334E-05 | global batch size: 256 | lm loss: 1.993049E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.120 | TFLOPs: 39.52 | 15: iteration 70710/ 125429 | consumed samples: 18101760 | consumed tokens: 37072404480 | elapsed time per iteration (s): 1.09 | learning rate: 9.332E-05 | global batch size: 256 | lm loss: 1.964734E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.545 | TFLOPs: 38.76 | 15: iteration 70720/ 125429 | consumed samples: 18104320 | consumed tokens: 37077647360 | elapsed time per iteration (s): 1.04 | learning rate: 9.330E-05 | global batch size: 256 | lm loss: 1.956668E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.705 | TFLOPs: 40.77 | 15: iteration 70730/ 125429 | consumed samples: 18106880 | consumed tokens: 37082890240 | elapsed time per iteration (s): 1.04 | learning rate: 9.327E-05 | global batch size: 256 | lm loss: 1.963256E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.200 | TFLOPs: 40.52 | 15: iteration 70740/ 125429 | consumed samples: 18109440 | consumed tokens: 37088133120 | elapsed time per iteration (s): 1.06 | learning rate: 9.325E-05 | global batch size: 256 | lm loss: 1.954530E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.752 | TFLOPs: 39.95 | 15: iteration 70750/ 125429 | consumed samples: 18112000 | consumed tokens: 37093376000 | elapsed time per iteration (s): 1.08 | learning rate: 9.323E-05 | global batch size: 256 | lm loss: 1.935378E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.231 | TFLOPs: 39.20 | 15: iteration 70760/ 125429 | consumed samples: 18114560 | consumed tokens: 37098618880 | elapsed time per iteration (s): 1.04 | learning rate: 9.321E-05 | global batch size: 256 | lm loss: 1.948443E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.181 | TFLOPs: 40.52 | 15: iteration 70770/ 125429 | consumed samples: 18117120 | consumed tokens: 37103861760 | elapsed time per iteration (s): 1.07 | learning rate: 9.319E-05 | global batch size: 256 | lm loss: 1.955162E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.166 | TFLOPs: 39.52 | 15: iteration 70780/ 125429 | consumed samples: 18119680 | consumed tokens: 37109104640 | elapsed time per iteration (s): 1.03 | learning rate: 9.316E-05 | global batch size: 256 | lm loss: 1.974371E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.703 | TFLOPs: 40.93 | 15: iteration 70790/ 125429 | consumed samples: 18122240 | consumed tokens: 37114347520 | elapsed time per iteration (s): 1.05 | learning rate: 9.314E-05 | global batch size: 256 | lm loss: 1.963932E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.868 | TFLOPs: 40.47 | 15: iteration 70800/ 125429 | consumed samples: 18124800 | consumed tokens: 37119590400 | elapsed time per iteration (s): 1.05 | learning rate: 9.312E-05 | global batch size: 256 | lm loss: 1.958846E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.736 | TFLOPs: 40.11 | 15: iteration 70810/ 125429 | consumed samples: 18127360 | consumed tokens: 37124833280 | elapsed time per iteration (s): 1.04 | learning rate: 9.310E-05 | global batch size: 256 | lm loss: 1.980258E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.259 | TFLOPs: 40.86 | 15: iteration 70820/ 125429 | consumed samples: 18129920 | consumed tokens: 37130076160 | elapsed time per iteration (s): 1.04 | learning rate: 9.307E-05 | global batch size: 256 | lm loss: 2.000452E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.737 | TFLOPs: 40.61 | 15: iteration 70830/ 125429 | consumed samples: 18132480 | consumed tokens: 37135319040 | elapsed time per iteration (s): 1.04 | learning rate: 9.305E-05 | global batch size: 256 | lm loss: 1.985164E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.054 | TFLOPs: 40.66 | 15: iteration 70840/ 125429 | consumed samples: 18135040 | consumed tokens: 37140561920 | elapsed time per iteration (s): 1.06 | learning rate: 9.303E-05 | global batch size: 256 | lm loss: 1.957646E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.498 | TFLOPs: 39.91 | 15: iteration 70850/ 125429 | consumed samples: 18137600 | consumed tokens: 37145804800 | elapsed time per iteration (s): 1.04 | learning rate: 9.301E-05 | global batch size: 256 | lm loss: 1.969676E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.755 | TFLOPs: 40.61 | 15: iteration 70860/ 125429 | consumed samples: 18140160 | consumed tokens: 37151047680 | elapsed time per iteration (s): 1.02 | learning rate: 9.298E-05 | global batch size: 256 | lm loss: 1.971861E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.041 | TFLOPs: 41.32 | 15: iteration 70870/ 125429 | consumed samples: 18142720 | consumed tokens: 37156290560 | elapsed time per iteration (s): 1.04 | learning rate: 9.296E-05 | global batch size: 256 | lm loss: 1.987286E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.665 | TFLOPs: 40.76 | 15: iteration 70880/ 125429 | consumed samples: 18145280 | consumed tokens: 37161533440 | elapsed time per iteration (s): 1.10 | learning rate: 9.294E-05 | global batch size: 256 | lm loss: 1.939825E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.491 | TFLOPs: 38.59 | 15: iteration 70890/ 125429 | consumed samples: 18147840 | consumed tokens: 37166776320 | elapsed time per iteration (s): 1.07 | learning rate: 9.292E-05 | global batch size: 256 | lm loss: 2.001227E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.535 | TFLOPs: 39.42 | 15: iteration 70900/ 125429 | consumed samples: 18150400 | consumed tokens: 37172019200 | elapsed time per iteration (s): 1.05 | learning rate: 9.289E-05 | global batch size: 256 | lm loss: 1.943764E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.308 | TFLOPs: 40.21 | 15: iteration 70910/ 125429 | consumed samples: 18152960 | consumed tokens: 37177262080 | elapsed time per iteration (s): 1.15 | learning rate: 9.287E-05 | global batch size: 256 | lm loss: 1.952081E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.764 | TFLOPs: 36.65 | 15: iteration 70920/ 125429 | consumed samples: 18155520 | consumed tokens: 37182504960 | elapsed time per iteration (s): 1.08 | learning rate: 9.285E-05 | global batch size: 256 | lm loss: 1.977546E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.873 | TFLOPs: 39.15 | 15: iteration 70930/ 125429 | consumed samples: 18158080 | consumed tokens: 37187747840 | elapsed time per iteration (s): 1.17 | learning rate: 9.283E-05 | global batch size: 256 | lm loss: 1.950698E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.465 | TFLOPs: 36.10 | 15: iteration 70940/ 125429 | consumed samples: 18160640 | consumed tokens: 37192990720 | elapsed time per iteration (s): 1.07 | learning rate: 9.281E-05 | global batch size: 256 | lm loss: 1.969331E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.891 | TFLOPs: 39.64 | 15: iteration 70950/ 125429 | consumed samples: 18163200 | consumed tokens: 37198233600 | elapsed time per iteration (s): 1.03 | learning rate: 9.278E-05 | global batch size: 256 | lm loss: 1.981845E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.722 | TFLOPs: 41.10 | 15: iteration 70960/ 125429 | consumed samples: 18165760 | consumed tokens: 37203476480 | elapsed time per iteration (s): 1.04 | learning rate: 9.276E-05 | global batch size: 256 | lm loss: 1.973463E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.067 | TFLOPs: 40.66 | 15: iteration 70970/ 125429 | consumed samples: 18168320 | consumed tokens: 37208719360 | elapsed time per iteration (s): 1.04 | learning rate: 9.274E-05 | global batch size: 256 | lm loss: 1.986730E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.810 | TFLOPs: 40.62 | 15: iteration 70980/ 125429 | consumed samples: 18170880 | consumed tokens: 37213962240 | elapsed time per iteration (s): 1.03 | learning rate: 9.272E-05 | global batch size: 256 | lm loss: 1.968526E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.752 | TFLOPs: 41.11 | 15: iteration 70990/ 125429 | consumed samples: 18173440 | consumed tokens: 37219205120 | elapsed time per iteration (s): 1.07 | learning rate: 9.269E-05 | global batch size: 256 | lm loss: 1.939260E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.628 | TFLOPs: 39.60 | 15: iteration 71000/ 125429 | consumed samples: 18176000 | consumed tokens: 37224448000 | elapsed time per iteration (s): 1.05 | learning rate: 9.267E-05 | global batch size: 256 | lm loss: 1.977943E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.309 | TFLOPs: 40.37 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 71000 | lm loss value: 1.923178E+00 | lm loss PPL: 6.842672E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 71000 to checkpoints_1b5 0: [2022-11-26 17:06:13,950] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step71000 is begin to save! 0: [2022-11-26 17:06:13,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:06:14,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:06:14,212] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:06:14,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:06:14,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:06:14,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:06:14,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:06:14,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:06:14,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:06:14,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:06:14,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:06:14,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:06:14,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:06:14,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:06:14,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:06:14,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:06:14,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:06:15,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:06:15,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:06:15,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:06:15,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:06:15,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:06:15,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:06:15,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:06:15,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:06:15,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:06:15,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:06:15,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:06:15,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:06:15,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:06:15,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:06:15,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:06:15,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:06:15,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:06:15,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:06:16,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:06:16,033] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:06:16,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:06:16,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:06:16,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:06:16,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:06:16,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:06:16,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:06:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:06:16,453] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:06:16,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:06:16,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:06:16,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:06:16,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:06:16,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:06:16,762] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:06:16,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:06:16,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:06:16,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:06:16,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_29-model_00-model_states.pt... 0: [2022-11-26 17:06:17,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_29-model_00-model_states.pt. 0: [2022-11-26 17:06:17,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:06:17,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:06:17,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/layer_32-model_00-model_states.pt... 0: [2022-11-26 17:06:17,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/layer_32-model_00-model_states.pt. 0: [2022-11-26 17:06:17,188] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step71000/mp_rank_00_model_states.pt 0: [2022-11-26 17:06:17,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:06:17,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step71000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:06:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 17:06:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 17:06:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:06:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:06:17,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 17:06:17,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 17:06:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 17:06:17,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:06:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 17:06:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 17:06:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 17:06:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 17:06:17,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 17:06:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 17:06:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:06:17,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:06:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 17:06:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:06:17,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 17:06:17,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 15: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:06:17,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 17:06:17,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 6: [2022-11-26 17:06:17,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:06:17,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 17:06:17,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 11: [2022-11-26 17:06:17,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:06:17,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 17:06:17,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:06:17,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 7: [2022-11-26 17:06:17,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 17:06:17,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 17:06:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 17:06:17,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 17:06:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 17:06:17,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 17:06:17,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 4: [2022-11-26 17:06:17,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:06:17,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 17:06:17,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 10: [2022-11-26 17:06:17,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:06:17,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:06:17,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:06:17,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:06:17,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 13: [2022-11-26 17:06:17,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:06:17,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 17:06:17,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 17:06:17,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:06:17,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 17:06:17,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 17:06:17,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 5: [2022-11-26 17:06:17,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:06:17,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 17:06:17,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:06:17,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 1: [2022-11-26 17:06:17,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 17:06:17,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 17:06:17,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 2: [2022-11-26 17:06:17,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:06:17,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 17:06:17,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 14: [2022-11-26 17:06:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: [2022-11-26 17:06:17,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:06:17,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,663] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 8: [2022-11-26 17:06:17,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:06:17,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 17:06:17,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 9: [2022-11-26 17:06:17,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 3: [2022-11-26 17:06:17,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:06:17,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:06:17,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step71000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 12: [2022-11-26 17:06:17,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step71000 is ready now! 0: successfully saved checkpoint at iteration 71000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3841.32 15: iteration 71010/ 125429 | consumed samples: 18178560 | consumed tokens: 37229690880 | elapsed time per iteration (s): 1.44 | learning rate: 9.265E-05 | global batch size: 256 | lm loss: 1.946593E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.322 | TFLOPs: 29.30 | 15: iteration 71020/ 125429 | consumed samples: 18181120 | consumed tokens: 37234933760 | elapsed time per iteration (s): 1.05 | learning rate: 9.263E-05 | global batch size: 256 | lm loss: 1.964473E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.100 | TFLOPs: 40.17 | 15: iteration 71030/ 125429 | consumed samples: 18183680 | consumed tokens: 37240176640 | elapsed time per iteration (s): 1.02 | learning rate: 9.260E-05 | global batch size: 256 | lm loss: 1.970742E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.635 | TFLOPs: 41.42 | 15: iteration 71040/ 125429 | consumed samples: 18186240 | consumed tokens: 37245419520 | elapsed time per iteration (s): 1.05 | learning rate: 9.258E-05 | global batch size: 256 | lm loss: 1.984081E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.412 | TFLOPs: 40.23 | 15: iteration 71050/ 125429 | consumed samples: 18188800 | consumed tokens: 37250662400 | elapsed time per iteration (s): 1.07 | learning rate: 9.256E-05 | global batch size: 256 | lm loss: 1.955503E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.548 | TFLOPs: 39.59 | 15: iteration 71060/ 125429 | consumed samples: 18191360 | consumed tokens: 37255905280 | elapsed time per iteration (s): 1.05 | learning rate: 9.254E-05 | global batch size: 256 | lm loss: 1.935204E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.610 | TFLOPs: 40.42 | 15: iteration 71070/ 125429 | consumed samples: 18193920 | consumed tokens: 37261148160 | elapsed time per iteration (s): 1.08 | learning rate: 9.251E-05 | global batch size: 256 | lm loss: 1.974694E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.700 | TFLOPs: 39.12 | 15: iteration 71080/ 125429 | consumed samples: 18196480 | consumed tokens: 37266391040 | elapsed time per iteration (s): 1.04 | learning rate: 9.249E-05 | global batch size: 256 | lm loss: 1.974226E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.793 | TFLOPs: 40.62 | 15: iteration 71090/ 125429 | consumed samples: 18199040 | consumed tokens: 37271633920 | elapsed time per iteration (s): 1.07 | learning rate: 9.247E-05 | global batch size: 256 | lm loss: 1.963564E+00 | grad norm: 0.197 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.273 | TFLOPs: 39.54 | 15: iteration 71100/ 125429 | consumed samples: 18201600 | consumed tokens: 37276876800 | elapsed time per iteration (s): 1.04 | learning rate: 9.245E-05 | global batch size: 256 | lm loss: 1.968999E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.130 | TFLOPs: 40.51 | 15: iteration 71110/ 125429 | consumed samples: 18204160 | consumed tokens: 37282119680 | elapsed time per iteration (s): 1.03 | learning rate: 9.243E-05 | global batch size: 256 | lm loss: 1.973536E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.692 | TFLOPs: 40.93 | 15: iteration 71120/ 125429 | consumed samples: 18206720 | consumed tokens: 37287362560 | elapsed time per iteration (s): 1.03 | learning rate: 9.240E-05 | global batch size: 256 | lm loss: 1.962182E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.362 | TFLOPs: 40.88 | 15: iteration 71130/ 125429 | consumed samples: 18209280 | consumed tokens: 37292605440 | elapsed time per iteration (s): 1.05 | learning rate: 9.238E-05 | global batch size: 256 | lm loss: 1.953217E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.991 | TFLOPs: 40.32 | 15: iteration 71140/ 125429 | consumed samples: 18211840 | consumed tokens: 37297848320 | elapsed time per iteration (s): 1.03 | learning rate: 9.236E-05 | global batch size: 256 | lm loss: 1.953235E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.714 | TFLOPs: 41.27 | 15: iteration 71150/ 125429 | consumed samples: 18214400 | consumed tokens: 37303091200 | elapsed time per iteration (s): 1.03 | learning rate: 9.234E-05 | global batch size: 256 | lm loss: 1.967145E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.438 | TFLOPs: 41.22 | 15: iteration 71160/ 125429 | consumed samples: 18216960 | consumed tokens: 37308334080 | elapsed time per iteration (s): 1.04 | learning rate: 9.231E-05 | global batch size: 256 | lm loss: 1.976188E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.083 | TFLOPs: 40.67 | 15: iteration 71170/ 125429 | consumed samples: 18219520 | consumed tokens: 37313576960 | elapsed time per iteration (s): 1.03 | learning rate: 9.229E-05 | global batch size: 256 | lm loss: 1.950392E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.026 | TFLOPs: 41.15 | 15: iteration 71180/ 125429 | consumed samples: 18222080 | consumed tokens: 37318819840 | elapsed time per iteration (s): 1.07 | learning rate: 9.227E-05 | global batch size: 256 | lm loss: 2.003196E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.767 | TFLOPs: 39.62 | 15: iteration 71190/ 125429 | consumed samples: 18224640 | consumed tokens: 37324062720 | elapsed time per iteration (s): 1.08 | learning rate: 9.225E-05 | global batch size: 256 | lm loss: 1.968837E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.129 | TFLOPs: 39.19 | 15: iteration 71200/ 125429 | consumed samples: 18227200 | consumed tokens: 37329305600 | elapsed time per iteration (s): 1.08 | learning rate: 9.222E-05 | global batch size: 256 | lm loss: 1.966278E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.509 | TFLOPs: 39.08 | 15: iteration 71210/ 125429 | consumed samples: 18229760 | consumed tokens: 37334548480 | elapsed time per iteration (s): 1.03 | learning rate: 9.220E-05 | global batch size: 256 | lm loss: 1.970953E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.132 | TFLOPs: 41.17 | 15: iteration 71220/ 125429 | consumed samples: 18232320 | consumed tokens: 37339791360 | elapsed time per iteration (s): 1.04 | learning rate: 9.218E-05 | global batch size: 256 | lm loss: 1.956958E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.546 | TFLOPs: 40.74 | 15: iteration 71230/ 125429 | consumed samples: 18234880 | consumed tokens: 37345034240 | elapsed time per iteration (s): 1.03 | learning rate: 9.216E-05 | global batch size: 256 | lm loss: 1.987242E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.879 | TFLOPs: 41.13 | 15: iteration 71240/ 125429 | consumed samples: 18237440 | consumed tokens: 37350277120 | elapsed time per iteration (s): 1.02 | learning rate: 9.214E-05 | global batch size: 256 | lm loss: 1.955222E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.309 | TFLOPs: 41.53 | 15: iteration 71250/ 125429 | consumed samples: 18240000 | consumed tokens: 37355520000 | elapsed time per iteration (s): 1.02 | learning rate: 9.211E-05 | global batch size: 256 | lm loss: 1.972195E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.871 | TFLOPs: 41.46 | 15: iteration 71260/ 125429 | consumed samples: 18242560 | consumed tokens: 37360762880 | elapsed time per iteration (s): 1.08 | learning rate: 9.209E-05 | global batch size: 256 | lm loss: 1.956342E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.797 | TFLOPs: 39.30 | 15: iteration 71270/ 125429 | consumed samples: 18245120 | consumed tokens: 37366005760 | elapsed time per iteration (s): 1.04 | learning rate: 9.207E-05 | global batch size: 256 | lm loss: 1.957608E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.214 | TFLOPs: 40.85 | 15: iteration 71280/ 125429 | consumed samples: 18247680 | consumed tokens: 37371248640 | elapsed time per iteration (s): 1.04 | learning rate: 9.205E-05 | global batch size: 256 | lm loss: 1.932445E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.240 | TFLOPs: 40.86 | 15: iteration 71290/ 125429 | consumed samples: 18250240 | consumed tokens: 37376491520 | elapsed time per iteration (s): 1.06 | learning rate: 9.202E-05 | global batch size: 256 | lm loss: 1.962321E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.995 | TFLOPs: 39.99 | 15: iteration 71300/ 125429 | consumed samples: 18252800 | consumed tokens: 37381734400 | elapsed time per iteration (s): 1.03 | learning rate: 9.200E-05 | global batch size: 256 | lm loss: 1.976218E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.721 | TFLOPs: 41.10 | 15: iteration 71310/ 125429 | consumed samples: 18255360 | consumed tokens: 37386977280 | elapsed time per iteration (s): 1.07 | learning rate: 9.198E-05 | global batch size: 256 | lm loss: 1.966227E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.132 | TFLOPs: 39.68 | 15: iteration 71320/ 125429 | consumed samples: 18257920 | consumed tokens: 37392220160 | elapsed time per iteration (s): 1.06 | learning rate: 9.196E-05 | global batch size: 256 | lm loss: 1.968468E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.625 | TFLOPs: 39.93 | 15: iteration 71330/ 125429 | consumed samples: 18260480 | consumed tokens: 37397463040 | elapsed time per iteration (s): 1.06 | learning rate: 9.193E-05 | global batch size: 256 | lm loss: 1.981334E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.342 | TFLOPs: 40.05 | 15: iteration 71340/ 125429 | consumed samples: 18263040 | consumed tokens: 37402705920 | elapsed time per iteration (s): 1.05 | learning rate: 9.191E-05 | global batch size: 256 | lm loss: 1.969272E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.271 | TFLOPs: 40.37 | 15: iteration 71350/ 125429 | consumed samples: 18265600 | consumed tokens: 37407948800 | elapsed time per iteration (s): 1.05 | learning rate: 9.189E-05 | global batch size: 256 | lm loss: 1.967081E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.951 | TFLOPs: 40.48 | 15: iteration 71360/ 125429 | consumed samples: 18268160 | consumed tokens: 37413191680 | elapsed time per iteration (s): 1.07 | learning rate: 9.187E-05 | global batch size: 256 | lm loss: 1.956408E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.242 | TFLOPs: 39.70 | 15: iteration 71370/ 125429 | consumed samples: 18270720 | consumed tokens: 37418434560 | elapsed time per iteration (s): 1.08 | learning rate: 9.185E-05 | global batch size: 256 | lm loss: 1.959372E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.979 | TFLOPs: 39.16 | 15: iteration 71380/ 125429 | consumed samples: 18273280 | consumed tokens: 37423677440 | elapsed time per iteration (s): 1.02 | learning rate: 9.182E-05 | global batch size: 256 | lm loss: 1.953249E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.256 | TFLOPs: 41.36 | 15: iteration 71390/ 125429 | consumed samples: 18275840 | consumed tokens: 37428920320 | elapsed time per iteration (s): 1.10 | learning rate: 9.180E-05 | global batch size: 256 | lm loss: 1.954010E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.668 | TFLOPs: 38.62 | 15: iteration 71400/ 125429 | consumed samples: 18278400 | consumed tokens: 37434163200 | elapsed time per iteration (s): 1.07 | learning rate: 9.178E-05 | global batch size: 256 | lm loss: 1.977511E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.250 | TFLOPs: 39.54 | 15: iteration 71410/ 125429 | consumed samples: 18280960 | consumed tokens: 37439406080 | elapsed time per iteration (s): 1.09 | learning rate: 9.176E-05 | global batch size: 256 | lm loss: 1.930255E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.692 | TFLOPs: 38.95 | 15: iteration 71420/ 125429 | consumed samples: 18283520 | consumed tokens: 37444648960 | elapsed time per iteration (s): 1.04 | learning rate: 9.173E-05 | global batch size: 256 | lm loss: 1.957666E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.153 | TFLOPs: 40.68 | 15: iteration 71430/ 125429 | consumed samples: 18286080 | consumed tokens: 37449891840 | elapsed time per iteration (s): 1.07 | learning rate: 9.171E-05 | global batch size: 256 | lm loss: 1.968155E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.814 | TFLOPs: 39.47 | 15: iteration 71440/ 125429 | consumed samples: 18288640 | consumed tokens: 37455134720 | elapsed time per iteration (s): 1.03 | learning rate: 9.169E-05 | global batch size: 256 | lm loss: 1.941977E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.726 | TFLOPs: 41.27 | 15: iteration 71450/ 125429 | consumed samples: 18291200 | consumed tokens: 37460377600 | elapsed time per iteration (s): 1.06 | learning rate: 9.167E-05 | global batch size: 256 | lm loss: 1.952446E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.229 | TFLOPs: 39.86 | 15: iteration 71460/ 125429 | consumed samples: 18293760 | consumed tokens: 37465620480 | elapsed time per iteration (s): 1.04 | learning rate: 9.164E-05 | global batch size: 256 | lm loss: 1.957857E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.199 | TFLOPs: 40.52 | 15: iteration 71470/ 125429 | consumed samples: 18296320 | consumed tokens: 37470863360 | elapsed time per iteration (s): 1.04 | learning rate: 9.162E-05 | global batch size: 256 | lm loss: 1.971885E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.406 | TFLOPs: 40.72 | 15: iteration 71480/ 125429 | consumed samples: 18298880 | consumed tokens: 37476106240 | elapsed time per iteration (s): 1.05 | learning rate: 9.160E-05 | global batch size: 256 | lm loss: 1.965044E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.175 | TFLOPs: 40.35 | 15: iteration 71490/ 125429 | consumed samples: 18301440 | consumed tokens: 37481349120 | elapsed time per iteration (s): 1.03 | learning rate: 9.158E-05 | global batch size: 256 | lm loss: 1.968613E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.299 | TFLOPs: 41.03 | 15: iteration 71500/ 125429 | consumed samples: 18304000 | consumed tokens: 37486592000 | elapsed time per iteration (s): 1.04 | learning rate: 9.156E-05 | global batch size: 256 | lm loss: 1.968608E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.992 | TFLOPs: 40.49 | 15: iteration 71510/ 125429 | consumed samples: 18306560 | consumed tokens: 37491834880 | elapsed time per iteration (s): 1.05 | learning rate: 9.153E-05 | global batch size: 256 | lm loss: 1.949697E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.435 | TFLOPs: 40.39 | 15: iteration 71520/ 125429 | consumed samples: 18309120 | consumed tokens: 37497077760 | elapsed time per iteration (s): 1.04 | learning rate: 9.151E-05 | global batch size: 256 | lm loss: 1.937965E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.252 | TFLOPs: 40.53 | 15: iteration 71530/ 125429 | consumed samples: 18311680 | consumed tokens: 37502320640 | elapsed time per iteration (s): 1.03 | learning rate: 9.149E-05 | global batch size: 256 | lm loss: 1.946089E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.614 | TFLOPs: 40.92 | 15: iteration 71540/ 125429 | consumed samples: 18314240 | consumed tokens: 37507563520 | elapsed time per iteration (s): 1.07 | learning rate: 9.147E-05 | global batch size: 256 | lm loss: 1.958251E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.138 | TFLOPs: 39.68 | 15: iteration 71550/ 125429 | consumed samples: 18316800 | consumed tokens: 37512806400 | elapsed time per iteration (s): 1.04 | learning rate: 9.144E-05 | global batch size: 256 | lm loss: 1.946101E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.939 | TFLOPs: 40.81 | 15: iteration 71560/ 125429 | consumed samples: 18319360 | consumed tokens: 37518049280 | elapsed time per iteration (s): 1.04 | learning rate: 9.142E-05 | global batch size: 256 | lm loss: 1.964999E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.732 | TFLOPs: 40.77 | 15: iteration 71570/ 125429 | consumed samples: 18321920 | consumed tokens: 37523292160 | elapsed time per iteration (s): 1.05 | learning rate: 9.140E-05 | global batch size: 256 | lm loss: 1.966736E+00 | grad norm: 0.130 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.047 | TFLOPs: 40.17 | 15: iteration 71580/ 125429 | consumed samples: 18324480 | consumed tokens: 37528535040 | elapsed time per iteration (s): 1.04 | learning rate: 9.138E-05 | global batch size: 256 | lm loss: 1.968366E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.120 | TFLOPs: 40.84 | 15: iteration 71590/ 125429 | consumed samples: 18327040 | consumed tokens: 37533777920 | elapsed time per iteration (s): 1.09 | learning rate: 9.135E-05 | global batch size: 256 | lm loss: 1.965176E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.792 | TFLOPs: 38.80 | 15: iteration 71600/ 125429 | consumed samples: 18329600 | consumed tokens: 37539020800 | elapsed time per iteration (s): 1.05 | learning rate: 9.133E-05 | global batch size: 256 | lm loss: 1.973380E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.306 | TFLOPs: 40.37 | 15: iteration 71610/ 125429 | consumed samples: 18332160 | consumed tokens: 37544263680 | elapsed time per iteration (s): 1.05 | learning rate: 9.131E-05 | global batch size: 256 | lm loss: 1.947832E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.109 | TFLOPs: 40.34 | 15: iteration 71620/ 125429 | consumed samples: 18334720 | consumed tokens: 37549506560 | elapsed time per iteration (s): 1.05 | learning rate: 9.129E-05 | global batch size: 256 | lm loss: 1.956580E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.511 | TFLOPs: 40.24 | 15: iteration 71630/ 125429 | consumed samples: 18337280 | consumed tokens: 37554749440 | elapsed time per iteration (s): 1.09 | learning rate: 9.127E-05 | global batch size: 256 | lm loss: 2.000918E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.508 | TFLOPs: 38.92 | 15: iteration 71640/ 125429 | consumed samples: 18339840 | consumed tokens: 37559992320 | elapsed time per iteration (s): 1.05 | learning rate: 9.124E-05 | global batch size: 256 | lm loss: 1.961017E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.724 | TFLOPs: 40.44 | 15: iteration 71650/ 125429 | consumed samples: 18342400 | consumed tokens: 37565235200 | elapsed time per iteration (s): 1.09 | learning rate: 9.122E-05 | global batch size: 256 | lm loss: 1.981939E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.521 | TFLOPs: 38.76 | 15: iteration 71660/ 125429 | consumed samples: 18344960 | consumed tokens: 37570478080 | elapsed time per iteration (s): 1.03 | learning rate: 9.120E-05 | global batch size: 256 | lm loss: 1.973771E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.285 | TFLOPs: 41.03 | 15: iteration 71670/ 125429 | consumed samples: 18347520 | consumed tokens: 37575720960 | elapsed time per iteration (s): 1.07 | learning rate: 9.118E-05 | global batch size: 256 | lm loss: 1.966642E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.985 | TFLOPs: 39.49 | 15: iteration 71680/ 125429 | consumed samples: 18350080 | consumed tokens: 37580963840 | elapsed time per iteration (s): 1.05 | learning rate: 9.115E-05 | global batch size: 256 | lm loss: 1.971301E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.735 | TFLOPs: 40.44 | 15: iteration 71690/ 125429 | consumed samples: 18352640 | consumed tokens: 37586206720 | elapsed time per iteration (s): 1.06 | learning rate: 9.113E-05 | global batch size: 256 | lm loss: 1.965246E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.625 | TFLOPs: 39.77 | 15: iteration 71700/ 125429 | consumed samples: 18355200 | consumed tokens: 37591449600 | elapsed time per iteration (s): 1.06 | learning rate: 9.111E-05 | global batch size: 256 | lm loss: 1.952127E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.921 | TFLOPs: 39.98 | 15: iteration 71710/ 125429 | consumed samples: 18357760 | consumed tokens: 37596692480 | elapsed time per iteration (s): 1.05 | learning rate: 9.109E-05 | global batch size: 256 | lm loss: 1.979830E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.440 | TFLOPs: 40.40 | 15: iteration 71720/ 125429 | consumed samples: 18360320 | consumed tokens: 37601935360 | elapsed time per iteration (s): 1.03 | learning rate: 9.107E-05 | global batch size: 256 | lm loss: 1.961004E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.200 | TFLOPs: 41.02 | 15: iteration 71730/ 125429 | consumed samples: 18362880 | consumed tokens: 37607178240 | elapsed time per iteration (s): 1.03 | learning rate: 9.104E-05 | global batch size: 256 | lm loss: 1.990188E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.463 | TFLOPs: 41.06 | 15: iteration 71740/ 125429 | consumed samples: 18365440 | consumed tokens: 37612421120 | elapsed time per iteration (s): 1.04 | learning rate: 9.102E-05 | global batch size: 256 | lm loss: 1.952921E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.120 | TFLOPs: 40.67 | 15: iteration 71750/ 125429 | consumed samples: 18368000 | consumed tokens: 37617664000 | elapsed time per iteration (s): 1.13 | learning rate: 9.100E-05 | global batch size: 256 | lm loss: 1.943957E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.216 | TFLOPs: 37.55 | 15: iteration 71760/ 125429 | consumed samples: 18370560 | consumed tokens: 37622906880 | elapsed time per iteration (s): 1.09 | learning rate: 9.098E-05 | global batch size: 256 | lm loss: 1.943881E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.815 | TFLOPs: 38.64 | 15: iteration 71770/ 125429 | consumed samples: 18373120 | consumed tokens: 37628149760 | elapsed time per iteration (s): 1.05 | learning rate: 9.095E-05 | global batch size: 256 | lm loss: 1.971315E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.115 | TFLOPs: 40.34 | 15: iteration 71780/ 125429 | consumed samples: 18375680 | consumed tokens: 37633392640 | elapsed time per iteration (s): 1.07 | learning rate: 9.093E-05 | global batch size: 256 | lm loss: 1.982458E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.237 | TFLOPs: 39.37 | 15: iteration 71790/ 125429 | consumed samples: 18378240 | consumed tokens: 37638635520 | elapsed time per iteration (s): 1.03 | learning rate: 9.091E-05 | global batch size: 256 | lm loss: 1.943440E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.430 | TFLOPs: 40.89 | 15: iteration 71800/ 125429 | consumed samples: 18380800 | consumed tokens: 37643878400 | elapsed time per iteration (s): 1.04 | learning rate: 9.089E-05 | global batch size: 256 | lm loss: 1.959229E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.886 | TFLOPs: 40.63 | 15: iteration 71810/ 125429 | consumed samples: 18383360 | consumed tokens: 37649121280 | elapsed time per iteration (s): 1.06 | learning rate: 9.086E-05 | global batch size: 256 | lm loss: 1.939840E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.259 | TFLOPs: 40.04 | 15: iteration 71820/ 125429 | consumed samples: 18385920 | consumed tokens: 37654364160 | elapsed time per iteration (s): 1.03 | learning rate: 9.084E-05 | global batch size: 256 | lm loss: 1.939727E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.089 | TFLOPs: 41.00 | 15: iteration 71830/ 125429 | consumed samples: 18388480 | consumed tokens: 37659607040 | elapsed time per iteration (s): 1.04 | learning rate: 9.082E-05 | global batch size: 256 | lm loss: 1.957056E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.203 | TFLOPs: 40.85 | 15: iteration 71840/ 125429 | consumed samples: 18391040 | consumed tokens: 37664849920 | elapsed time per iteration (s): 1.08 | learning rate: 9.080E-05 | global batch size: 256 | lm loss: 1.978091E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.297 | TFLOPs: 39.22 | 15: iteration 71850/ 125429 | consumed samples: 18393600 | consumed tokens: 37670092800 | elapsed time per iteration (s): 1.04 | learning rate: 9.078E-05 | global batch size: 256 | lm loss: 1.978113E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.906 | TFLOPs: 40.64 | 15: iteration 71860/ 125429 | consumed samples: 18396160 | consumed tokens: 37675335680 | elapsed time per iteration (s): 1.04 | learning rate: 9.075E-05 | global batch size: 256 | lm loss: 1.961989E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.056 | TFLOPs: 40.66 | 15: iteration 71870/ 125429 | consumed samples: 18398720 | consumed tokens: 37680578560 | elapsed time per iteration (s): 1.03 | learning rate: 9.073E-05 | global batch size: 256 | lm loss: 1.980816E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.647 | TFLOPs: 40.93 | 15: iteration 71880/ 125429 | consumed samples: 18401280 | consumed tokens: 37685821440 | elapsed time per iteration (s): 1.07 | learning rate: 9.071E-05 | global batch size: 256 | lm loss: 1.945619E+00 | grad norm: 0.128 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.983 | TFLOPs: 39.66 | 15: iteration 71890/ 125429 | consumed samples: 18403840 | consumed tokens: 37691064320 | elapsed time per iteration (s): 1.03 | learning rate: 9.069E-05 | global batch size: 256 | lm loss: 1.967514E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.167 | TFLOPs: 41.01 | 15: iteration 71900/ 125429 | consumed samples: 18406400 | consumed tokens: 37696307200 | elapsed time per iteration (s): 1.02 | learning rate: 9.066E-05 | global batch size: 256 | lm loss: 1.955127E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.956 | TFLOPs: 41.31 | 15: iteration 71910/ 125429 | consumed samples: 18408960 | consumed tokens: 37701550080 | elapsed time per iteration (s): 1.05 | learning rate: 9.064E-05 | global batch size: 256 | lm loss: 1.991438E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.192 | TFLOPs: 40.19 | 15: iteration 71920/ 125429 | consumed samples: 18411520 | consumed tokens: 37706792960 | elapsed time per iteration (s): 1.03 | learning rate: 9.062E-05 | global batch size: 256 | lm loss: 1.977237E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.096 | TFLOPs: 41.00 | 15: iteration 71930/ 125429 | consumed samples: 18414080 | consumed tokens: 37712035840 | elapsed time per iteration (s): 1.04 | learning rate: 9.060E-05 | global batch size: 256 | lm loss: 1.944495E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.133 | TFLOPs: 40.51 | 15: iteration 71940/ 125429 | consumed samples: 18416640 | consumed tokens: 37717278720 | elapsed time per iteration (s): 1.03 | learning rate: 9.058E-05 | global batch size: 256 | lm loss: 1.973170E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.396 | TFLOPs: 41.05 | 15: iteration 71950/ 125429 | consumed samples: 18419200 | consumed tokens: 37722521600 | elapsed time per iteration (s): 1.05 | learning rate: 9.055E-05 | global batch size: 256 | lm loss: 1.965011E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.508 | TFLOPs: 40.24 | 15: iteration 71960/ 125429 | consumed samples: 18421760 | consumed tokens: 37727764480 | elapsed time per iteration (s): 1.04 | learning rate: 9.053E-05 | global batch size: 256 | lm loss: 1.971589E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.274 | TFLOPs: 40.86 | 15: iteration 71970/ 125429 | consumed samples: 18424320 | consumed tokens: 37733007360 | elapsed time per iteration (s): 1.02 | learning rate: 9.051E-05 | global batch size: 256 | lm loss: 2.004194E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.277 | TFLOPs: 41.53 | 15: iteration 71980/ 125429 | consumed samples: 18426880 | consumed tokens: 37738250240 | elapsed time per iteration (s): 1.02 | learning rate: 9.049E-05 | global batch size: 256 | lm loss: 1.971539E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.063 | TFLOPs: 41.32 | 15: iteration 71990/ 125429 | consumed samples: 18429440 | consumed tokens: 37743493120 | elapsed time per iteration (s): 1.04 | learning rate: 9.046E-05 | global batch size: 256 | lm loss: 1.955430E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.753 | TFLOPs: 40.61 | 0: [2022-11-26 17:23:47,084] [INFO] [logging.py:68:log_dist] [Rank 0] step=72000, skipped=0, lr=[9.044246320112761e-05, 9.044246320112761e-05, 9.044246320112761e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 72000/ 125429 | consumed samples: 18432000 | consumed tokens: 37748736000 | elapsed time per iteration (s): 1.06 | learning rate: 9.044E-05 | global batch size: 256 | lm loss: 1.956240E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.895 | TFLOPs: 39.81 | 0: steps: 72000 loss: 1.9044 iter time (s): 1.049 samples/sec: 243.931 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 72000 | lm loss value: 1.916359E+00 | lm loss PPL: 6.796169E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 72000 to checkpoints_1b5 0: [2022-11-26 17:23:47,431] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step72000 is begin to save! 0: [2022-11-26 17:23:47,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:23:47,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:23:47,703] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:23:47,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:23:47,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:23:47,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:23:47,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:23:48,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:23:48,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:23:48,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:23:48,139] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:23:48,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:23:48,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:23:48,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:23:48,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:23:48,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:23:48,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:23:48,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:23:48,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:23:48,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:23:48,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:23:48,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:23:48,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:23:48,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:23:48,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:23:48,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:23:48,993] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:23:49,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:23:49,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:23:49,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:23:49,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:23:49,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:23:49,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:23:49,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:23:49,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:23:49,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:23:49,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:23:49,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:23:49,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:23:49,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:23:49,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:23:49,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:23:49,835] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:23:49,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:23:49,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:23:50,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:23:50,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:23:50,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:23:50,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:23:50,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:23:50,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:23:50,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:23:50,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:23:50,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:23:50,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_29-model_00-model_states.pt... 0: [2022-11-26 17:23:50,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_29-model_00-model_states.pt. 0: [2022-11-26 17:23:50,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:23:50,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:23:50,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/layer_32-model_00-model_states.pt... 0: [2022-11-26 17:23:50,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/layer_32-model_00-model_states.pt. 0: [2022-11-26 17:23:50,673] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step72000/mp_rank_00_model_states.pt 0: [2022-11-26 17:23:50,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:23:50,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:23:50,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step72000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:23:50,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:23:50,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:50,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 17:23:50,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:23:50,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 17:23:50,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 17:23:50,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:23:50,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 17:23:50,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:23:50,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:23:50,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:23:50,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:23:50,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:50,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,888] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,888] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 3: [2022-11-26 17:23:50,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 17:23:50,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 17:23:50,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:50,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 17:23:50,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:50,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 17:23:50,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:50,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 17:23:50,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:50,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 2: [2022-11-26 17:23:50,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 17:23:50,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:23:50,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:23:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 10: [2022-11-26 17:23:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 17:23:50,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 17:23:50,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:23:50,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:23:50,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:23:50,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:23:50,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 17:23:50,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 17:23:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 17:23:50,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:23:50,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:23:50,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 17:23:50,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 17:23:50,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:23:50,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:23:50,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 17:23:50,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 1: [2022-11-26 17:23:50,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 17:23:50,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:50,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 17:23:50,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:23:50,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:50,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 3: [2022-11-26 17:23:50,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 17:23:50,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:23:50,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 17:23:50,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:23:50,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 17:23:50,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 17:23:50,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:23:50,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 7: [2022-11-26 17:23:50,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:23:50,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 17:23:50,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 17:23:50,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 17:23:50,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 14: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:23:50,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 17:23:50,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:23:50,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 17:23:50,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:23:50,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:23:50,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:23:50,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:23:50,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:50,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 17:23:50,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 5: [2022-11-26 17:23:50,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:23:50,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 17:23:50,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 12: [2022-11-26 17:23:50,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:23:50,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 17:23:50,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:23:50,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:23:50,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:23:50,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 9: [2022-11-26 17:23:50,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:23:50,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 17:23:50,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:23:50,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:23:50,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:23:50,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:23:50,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:23:50,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 11: [2022-11-26 17:23:50,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 13: [2022-11-26 17:23:50,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:23:50,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:23:50,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-26 17:23:50,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-26 17:23:50,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 8: [2022-11-26 17:23:50,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:50,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:50,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 17:23:50,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 10: [2022-11-26 17:23:50,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:23:50,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 17:23:50,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 15: [2022-11-26 17:23:50,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:23:50,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 17:23:50,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:50,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:23:50,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:50,990] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:50,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:50,990] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 4: [2022-11-26 17:23:50,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:23:50,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:23:50,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: [2022-11-26 17:23:51,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:23:51,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 2: [2022-11-26 17:23:51,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:23:51,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 17:23:51,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step72000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 6: [2022-11-26 17:23:51,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step72000 is ready now! 0: successfully saved checkpoint at iteration 72000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3767.06 15: iteration 72010/ 125429 | consumed samples: 18434560 | consumed tokens: 37753978880 | elapsed time per iteration (s): 1.57 | learning rate: 9.042E-05 | global batch size: 256 | lm loss: 1.949203E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 163.479 | TFLOPs: 27.02 | 15: iteration 72020/ 125429 | consumed samples: 18437120 | consumed tokens: 37759221760 | elapsed time per iteration (s): 1.06 | learning rate: 9.040E-05 | global batch size: 256 | lm loss: 1.983754E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.455 | TFLOPs: 40.07 | 15: iteration 72030/ 125429 | consumed samples: 18439680 | consumed tokens: 37764464640 | elapsed time per iteration (s): 1.20 | learning rate: 9.038E-05 | global batch size: 256 | lm loss: 1.953254E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.775 | TFLOPs: 35.16 | 15: iteration 72040/ 125429 | consumed samples: 18442240 | consumed tokens: 37769707520 | elapsed time per iteration (s): 1.02 | learning rate: 9.035E-05 | global batch size: 256 | lm loss: 1.984693E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.804 | TFLOPs: 41.45 | 15: iteration 72050/ 125429 | consumed samples: 18444800 | consumed tokens: 37774950400 | elapsed time per iteration (s): 1.08 | learning rate: 9.033E-05 | global batch size: 256 | lm loss: 1.938253E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.276 | TFLOPs: 39.21 | 15: iteration 72060/ 125429 | consumed samples: 18447360 | consumed tokens: 37780193280 | elapsed time per iteration (s): 1.02 | learning rate: 9.031E-05 | global batch size: 256 | lm loss: 1.973456E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.363 | TFLOPs: 41.54 | 15: iteration 72070/ 125429 | consumed samples: 18449920 | consumed tokens: 37785436160 | elapsed time per iteration (s): 1.09 | learning rate: 9.029E-05 | global batch size: 256 | lm loss: 1.956202E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.603 | TFLOPs: 38.94 | 15: iteration 72080/ 125429 | consumed samples: 18452480 | consumed tokens: 37790679040 | elapsed time per iteration (s): 1.04 | learning rate: 9.026E-05 | global batch size: 256 | lm loss: 1.974990E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.197 | TFLOPs: 40.85 | 15: iteration 72090/ 125429 | consumed samples: 18455040 | consumed tokens: 37795921920 | elapsed time per iteration (s): 1.05 | learning rate: 9.024E-05 | global batch size: 256 | lm loss: 1.975843E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.036 | TFLOPs: 40.16 | 15: iteration 72100/ 125429 | consumed samples: 18457600 | consumed tokens: 37801164800 | elapsed time per iteration (s): 1.04 | learning rate: 9.022E-05 | global batch size: 256 | lm loss: 1.975105E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.302 | TFLOPs: 40.54 | 15: iteration 72110/ 125429 | consumed samples: 18460160 | consumed tokens: 37806407680 | elapsed time per iteration (s): 1.04 | learning rate: 9.020E-05 | global batch size: 256 | lm loss: 1.952455E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.908 | TFLOPs: 40.64 | 15: iteration 72120/ 125429 | consumed samples: 18462720 | consumed tokens: 37811650560 | elapsed time per iteration (s): 1.05 | learning rate: 9.018E-05 | global batch size: 256 | lm loss: 1.940788E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.814 | TFLOPs: 40.46 | 15: iteration 72130/ 125429 | consumed samples: 18465280 | consumed tokens: 37816893440 | elapsed time per iteration (s): 1.07 | learning rate: 9.015E-05 | global batch size: 256 | lm loss: 1.952937E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.379 | TFLOPs: 39.56 | 15: iteration 72140/ 125429 | consumed samples: 18467840 | consumed tokens: 37822136320 | elapsed time per iteration (s): 1.04 | learning rate: 9.013E-05 | global batch size: 256 | lm loss: 1.959535E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.262 | TFLOPs: 40.86 | 15: iteration 72150/ 125429 | consumed samples: 18470400 | consumed tokens: 37827379200 | elapsed time per iteration (s): 1.03 | learning rate: 9.011E-05 | global batch size: 256 | lm loss: 1.931271E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.107 | TFLOPs: 41.17 | 15: iteration 72160/ 125429 | consumed samples: 18472960 | consumed tokens: 37832622080 | elapsed time per iteration (s): 1.10 | learning rate: 9.009E-05 | global batch size: 256 | lm loss: 1.978551E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.346 | TFLOPs: 38.40 | 15: iteration 72170/ 125429 | consumed samples: 18475520 | consumed tokens: 37837864960 | elapsed time per iteration (s): 1.03 | learning rate: 9.006E-05 | global batch size: 256 | lm loss: 1.965570E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.258 | TFLOPs: 41.19 | 15: iteration 72180/ 125429 | consumed samples: 18478080 | consumed tokens: 37843107840 | elapsed time per iteration (s): 1.02 | learning rate: 9.004E-05 | global batch size: 256 | lm loss: 1.964884E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.557 | TFLOPs: 41.41 | 15: iteration 72190/ 125429 | consumed samples: 18480640 | consumed tokens: 37848350720 | elapsed time per iteration (s): 1.04 | learning rate: 9.002E-05 | global batch size: 256 | lm loss: 1.956102E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.068 | TFLOPs: 40.66 | 15: iteration 72200/ 125429 | consumed samples: 18483200 | consumed tokens: 37853593600 | elapsed time per iteration (s): 1.07 | learning rate: 9.000E-05 | global batch size: 256 | lm loss: 1.944901E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.830 | TFLOPs: 39.63 | 15: iteration 72210/ 125429 | consumed samples: 18485760 | consumed tokens: 37858836480 | elapsed time per iteration (s): 1.05 | learning rate: 8.998E-05 | global batch size: 256 | lm loss: 1.929015E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.844 | TFLOPs: 40.30 | 15: iteration 72220/ 125429 | consumed samples: 18488320 | consumed tokens: 37864079360 | elapsed time per iteration (s): 1.08 | learning rate: 8.995E-05 | global batch size: 256 | lm loss: 1.957492E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.580 | TFLOPs: 39.10 | 15: iteration 72230/ 125429 | consumed samples: 18490880 | consumed tokens: 37869322240 | elapsed time per iteration (s): 1.03 | learning rate: 8.993E-05 | global batch size: 256 | lm loss: 1.966594E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.249 | TFLOPs: 41.03 | 15: iteration 72240/ 125429 | consumed samples: 18493440 | consumed tokens: 37874565120 | elapsed time per iteration (s): 1.03 | learning rate: 8.991E-05 | global batch size: 256 | lm loss: 1.997121E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.001 | TFLOPs: 41.15 | 15: iteration 72250/ 125429 | consumed samples: 18496000 | consumed tokens: 37879808000 | elapsed time per iteration (s): 1.03 | learning rate: 8.989E-05 | global batch size: 256 | lm loss: 1.946553E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.387 | TFLOPs: 41.05 | 15: iteration 72260/ 125429 | consumed samples: 18498560 | consumed tokens: 37885050880 | elapsed time per iteration (s): 1.05 | learning rate: 8.987E-05 | global batch size: 256 | lm loss: 1.947297E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.545 | TFLOPs: 40.41 | 15: iteration 72270/ 125429 | consumed samples: 18501120 | consumed tokens: 37890293760 | elapsed time per iteration (s): 1.07 | learning rate: 8.984E-05 | global batch size: 256 | lm loss: 1.975628E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.832 | TFLOPs: 39.63 | 15: iteration 72280/ 125429 | consumed samples: 18503680 | consumed tokens: 37895536640 | elapsed time per iteration (s): 1.07 | learning rate: 8.982E-05 | global batch size: 256 | lm loss: 1.970417E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.586 | TFLOPs: 39.43 | 15: iteration 72290/ 125429 | consumed samples: 18506240 | consumed tokens: 37900779520 | elapsed time per iteration (s): 1.04 | learning rate: 8.980E-05 | global batch size: 256 | lm loss: 1.930011E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.338 | TFLOPs: 40.71 | 15: iteration 72300/ 125429 | consumed samples: 18508800 | consumed tokens: 37906022400 | elapsed time per iteration (s): 1.02 | learning rate: 8.978E-05 | global batch size: 256 | lm loss: 1.985149E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.858 | TFLOPs: 41.29 | 15: iteration 72310/ 125429 | consumed samples: 18511360 | consumed tokens: 37911265280 | elapsed time per iteration (s): 1.08 | learning rate: 8.975E-05 | global batch size: 256 | lm loss: 1.963200E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.628 | TFLOPs: 39.27 | 15: iteration 72320/ 125429 | consumed samples: 18513920 | consumed tokens: 37916508160 | elapsed time per iteration (s): 1.03 | learning rate: 8.973E-05 | global batch size: 256 | lm loss: 1.936825E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.244 | TFLOPs: 41.19 | 15: iteration 72330/ 125429 | consumed samples: 18516480 | consumed tokens: 37921751040 | elapsed time per iteration (s): 1.09 | learning rate: 8.971E-05 | global batch size: 256 | lm loss: 1.937311E+00 | grad norm: 0.129 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.154 | TFLOPs: 38.86 | 15: iteration 72340/ 125429 | consumed samples: 18519040 | consumed tokens: 37926993920 | elapsed time per iteration (s): 1.04 | learning rate: 8.969E-05 | global batch size: 256 | lm loss: 1.956523E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.031 | TFLOPs: 40.49 | 15: iteration 72350/ 125429 | consumed samples: 18521600 | consumed tokens: 37932236800 | elapsed time per iteration (s): 1.04 | learning rate: 8.967E-05 | global batch size: 256 | lm loss: 1.974375E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.913 | TFLOPs: 40.80 | 15: iteration 72360/ 125429 | consumed samples: 18524160 | consumed tokens: 37937479680 | elapsed time per iteration (s): 1.04 | learning rate: 8.964E-05 | global batch size: 256 | lm loss: 1.981890E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.262 | TFLOPs: 40.70 | 15: iteration 72370/ 125429 | consumed samples: 18526720 | consumed tokens: 37942722560 | elapsed time per iteration (s): 1.02 | learning rate: 8.962E-05 | global batch size: 256 | lm loss: 1.971420E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.759 | TFLOPs: 41.27 | 15: iteration 72380/ 125429 | consumed samples: 18529280 | consumed tokens: 37947965440 | elapsed time per iteration (s): 1.04 | learning rate: 8.960E-05 | global batch size: 256 | lm loss: 1.953041E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.432 | TFLOPs: 40.72 | 15: iteration 72390/ 125429 | consumed samples: 18531840 | consumed tokens: 37953208320 | elapsed time per iteration (s): 1.06 | learning rate: 8.958E-05 | global batch size: 256 | lm loss: 1.964171E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.251 | TFLOPs: 40.03 | 15: iteration 72400/ 125429 | consumed samples: 18534400 | consumed tokens: 37958451200 | elapsed time per iteration (s): 1.03 | learning rate: 8.955E-05 | global batch size: 256 | lm loss: 1.982766E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.630 | TFLOPs: 41.25 | 15: iteration 72410/ 125429 | consumed samples: 18536960 | consumed tokens: 37963694080 | elapsed time per iteration (s): 1.03 | learning rate: 8.953E-05 | global batch size: 256 | lm loss: 1.975655E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.491 | TFLOPs: 40.90 | 15: iteration 72420/ 125429 | consumed samples: 18539520 | consumed tokens: 37968936960 | elapsed time per iteration (s): 1.04 | learning rate: 8.951E-05 | global batch size: 256 | lm loss: 1.960347E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.541 | TFLOPs: 40.58 | 15: iteration 72430/ 125429 | consumed samples: 18542080 | consumed tokens: 37974179840 | elapsed time per iteration (s): 1.02 | learning rate: 8.949E-05 | global batch size: 256 | lm loss: 1.965715E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.887 | TFLOPs: 41.46 | 15: iteration 72440/ 125429 | consumed samples: 18544640 | consumed tokens: 37979422720 | elapsed time per iteration (s): 1.08 | learning rate: 8.947E-05 | global batch size: 256 | lm loss: 1.952808E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.024 | TFLOPs: 39.17 | 15: iteration 72450/ 125429 | consumed samples: 18547200 | consumed tokens: 37984665600 | elapsed time per iteration (s): 1.06 | learning rate: 8.944E-05 | global batch size: 256 | lm loss: 1.938151E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.021 | TFLOPs: 40.00 | 15: iteration 72460/ 125429 | consumed samples: 18549760 | consumed tokens: 37989908480 | elapsed time per iteration (s): 1.02 | learning rate: 8.942E-05 | global batch size: 256 | lm loss: 1.987586E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.840 | TFLOPs: 41.29 | 15: iteration 72470/ 125429 | consumed samples: 18552320 | consumed tokens: 37995151360 | elapsed time per iteration (s): 1.03 | learning rate: 8.940E-05 | global batch size: 256 | lm loss: 1.960933E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.568 | TFLOPs: 41.24 | 15: iteration 72480/ 125429 | consumed samples: 18554880 | consumed tokens: 38000394240 | elapsed time per iteration (s): 1.04 | learning rate: 8.938E-05 | global batch size: 256 | lm loss: 1.977748E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.016 | TFLOPs: 40.49 | 15: iteration 72490/ 125429 | consumed samples: 18557440 | consumed tokens: 38005637120 | elapsed time per iteration (s): 1.05 | learning rate: 8.935E-05 | global batch size: 256 | lm loss: 1.942358E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.114 | TFLOPs: 40.34 | 15: iteration 72500/ 125429 | consumed samples: 18560000 | consumed tokens: 38010880000 | elapsed time per iteration (s): 1.03 | learning rate: 8.933E-05 | global batch size: 256 | lm loss: 1.937607E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.390 | TFLOPs: 41.05 | 15: iteration 72510/ 125429 | consumed samples: 18562560 | consumed tokens: 38016122880 | elapsed time per iteration (s): 1.05 | learning rate: 8.931E-05 | global batch size: 256 | lm loss: 1.957182E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.616 | TFLOPs: 40.26 | 15: iteration 72520/ 125429 | consumed samples: 18565120 | consumed tokens: 38021365760 | elapsed time per iteration (s): 1.06 | learning rate: 8.929E-05 | global batch size: 256 | lm loss: 1.941375E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.548 | TFLOPs: 39.75 | 15: iteration 72530/ 125429 | consumed samples: 18567680 | consumed tokens: 38026608640 | elapsed time per iteration (s): 1.07 | learning rate: 8.927E-05 | global batch size: 256 | lm loss: 1.972708E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.113 | TFLOPs: 39.68 | 15: iteration 72540/ 125429 | consumed samples: 18570240 | consumed tokens: 38031851520 | elapsed time per iteration (s): 1.08 | learning rate: 8.924E-05 | global batch size: 256 | lm loss: 1.957808E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.306 | TFLOPs: 39.05 | 15: iteration 72550/ 125429 | consumed samples: 18572800 | consumed tokens: 38037094400 | elapsed time per iteration (s): 1.05 | learning rate: 8.922E-05 | global batch size: 256 | lm loss: 1.938613E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.913 | TFLOPs: 40.14 | 15: iteration 72560/ 125429 | consumed samples: 18575360 | consumed tokens: 38042337280 | elapsed time per iteration (s): 1.05 | learning rate: 8.920E-05 | global batch size: 256 | lm loss: 1.970147E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.249 | TFLOPs: 40.20 | 15: iteration 72570/ 125429 | consumed samples: 18577920 | consumed tokens: 38047580160 | elapsed time per iteration (s): 1.03 | learning rate: 8.918E-05 | global batch size: 256 | lm loss: 1.972377E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.362 | TFLOPs: 40.88 | 15: iteration 72580/ 125429 | consumed samples: 18580480 | consumed tokens: 38052823040 | elapsed time per iteration (s): 1.07 | learning rate: 8.916E-05 | global batch size: 256 | lm loss: 1.964234E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.942 | TFLOPs: 39.49 | 15: iteration 72590/ 125429 | consumed samples: 18583040 | consumed tokens: 38058065920 | elapsed time per iteration (s): 1.04 | learning rate: 8.913E-05 | global batch size: 256 | lm loss: 1.958185E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.019 | TFLOPs: 40.82 | 15: iteration 72600/ 125429 | consumed samples: 18585600 | consumed tokens: 38063308800 | elapsed time per iteration (s): 1.03 | learning rate: 8.911E-05 | global batch size: 256 | lm loss: 1.957171E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.728 | TFLOPs: 40.94 | 15: iteration 72610/ 125429 | consumed samples: 18588160 | consumed tokens: 38068551680 | elapsed time per iteration (s): 1.05 | learning rate: 8.909E-05 | global batch size: 256 | lm loss: 1.935777E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.728 | TFLOPs: 40.44 | 15: iteration 72620/ 125429 | consumed samples: 18590720 | consumed tokens: 38073794560 | elapsed time per iteration (s): 1.05 | learning rate: 8.907E-05 | global batch size: 256 | lm loss: 1.966459E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.411 | TFLOPs: 40.39 | 15: iteration 72630/ 125429 | consumed samples: 18593280 | consumed tokens: 38079037440 | elapsed time per iteration (s): 1.03 | learning rate: 8.904E-05 | global batch size: 256 | lm loss: 1.948946E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.617 | TFLOPs: 40.92 | 15: iteration 72640/ 125429 | consumed samples: 18595840 | consumed tokens: 38084280320 | elapsed time per iteration (s): 1.02 | learning rate: 8.902E-05 | global batch size: 256 | lm loss: 1.966662E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.614 | TFLOPs: 41.58 | 15: iteration 72650/ 125429 | consumed samples: 18598400 | consumed tokens: 38089523200 | elapsed time per iteration (s): 1.03 | learning rate: 8.900E-05 | global batch size: 256 | lm loss: 1.964984E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.607 | TFLOPs: 40.92 | 15: iteration 72660/ 125429 | consumed samples: 18600960 | consumed tokens: 38094766080 | elapsed time per iteration (s): 1.03 | learning rate: 8.898E-05 | global batch size: 256 | lm loss: 1.948543E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.382 | TFLOPs: 40.88 | 15: iteration 72670/ 125429 | consumed samples: 18603520 | consumed tokens: 38100008960 | elapsed time per iteration (s): 1.04 | learning rate: 8.896E-05 | global batch size: 256 | lm loss: 1.949920E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.856 | TFLOPs: 40.79 | 15: iteration 72680/ 125429 | consumed samples: 18606080 | consumed tokens: 38105251840 | elapsed time per iteration (s): 1.02 | learning rate: 8.893E-05 | global batch size: 256 | lm loss: 1.969542E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.448 | TFLOPs: 41.39 | 15: iteration 72690/ 125429 | consumed samples: 18608640 | consumed tokens: 38110494720 | elapsed time per iteration (s): 1.04 | learning rate: 8.891E-05 | global batch size: 256 | lm loss: 1.967649E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.658 | TFLOPs: 40.60 | 15: iteration 72700/ 125429 | consumed samples: 18611200 | consumed tokens: 38115737600 | elapsed time per iteration (s): 1.03 | learning rate: 8.889E-05 | global batch size: 256 | lm loss: 1.949025E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.650 | TFLOPs: 41.26 | 15: iteration 72710/ 125429 | consumed samples: 18613760 | consumed tokens: 38120980480 | elapsed time per iteration (s): 1.03 | learning rate: 8.887E-05 | global batch size: 256 | lm loss: 1.926871E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.009 | TFLOPs: 40.99 | 15: iteration 72720/ 125429 | consumed samples: 18616320 | consumed tokens: 38126223360 | elapsed time per iteration (s): 1.06 | learning rate: 8.885E-05 | global batch size: 256 | lm loss: 1.967519E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.732 | TFLOPs: 39.95 | 15: iteration 72730/ 125429 | consumed samples: 18618880 | consumed tokens: 38131466240 | elapsed time per iteration (s): 1.03 | learning rate: 8.882E-05 | global batch size: 256 | lm loss: 1.995716E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.423 | TFLOPs: 40.89 | 15: iteration 72740/ 125429 | consumed samples: 18621440 | consumed tokens: 38136709120 | elapsed time per iteration (s): 1.08 | learning rate: 8.880E-05 | global batch size: 256 | lm loss: 1.982881E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.561 | TFLOPs: 39.26 | 15: iteration 72750/ 125429 | consumed samples: 18624000 | consumed tokens: 38141952000 | elapsed time per iteration (s): 1.04 | learning rate: 8.878E-05 | global batch size: 256 | lm loss: 1.942487E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.283 | TFLOPs: 40.87 | 15: iteration 72760/ 125429 | consumed samples: 18626560 | consumed tokens: 38147194880 | elapsed time per iteration (s): 1.04 | learning rate: 8.876E-05 | global batch size: 256 | lm loss: 1.981109E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.150 | TFLOPs: 40.84 | 15: iteration 72770/ 125429 | consumed samples: 18629120 | consumed tokens: 38152437760 | elapsed time per iteration (s): 1.04 | learning rate: 8.873E-05 | global batch size: 256 | lm loss: 1.945781E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.282 | TFLOPs: 40.87 | 15: iteration 72780/ 125429 | consumed samples: 18631680 | consumed tokens: 38157680640 | elapsed time per iteration (s): 1.05 | learning rate: 8.871E-05 | global batch size: 256 | lm loss: 1.988702E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.533 | TFLOPs: 40.41 | 15: iteration 72790/ 125429 | consumed samples: 18634240 | consumed tokens: 38162923520 | elapsed time per iteration (s): 1.05 | learning rate: 8.869E-05 | global batch size: 256 | lm loss: 1.972740E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.963 | TFLOPs: 40.32 | 15: iteration 72800/ 125429 | consumed samples: 18636800 | consumed tokens: 38168166400 | elapsed time per iteration (s): 1.03 | learning rate: 8.867E-05 | global batch size: 256 | lm loss: 1.951383E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.510 | TFLOPs: 41.07 | 15: iteration 72810/ 125429 | consumed samples: 18639360 | consumed tokens: 38173409280 | elapsed time per iteration (s): 1.04 | learning rate: 8.865E-05 | global batch size: 256 | lm loss: 1.989587E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.342 | TFLOPs: 40.88 | 15: iteration 72820/ 125429 | consumed samples: 18641920 | consumed tokens: 38178652160 | elapsed time per iteration (s): 1.02 | learning rate: 8.862E-05 | global batch size: 256 | lm loss: 1.959031E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.901 | TFLOPs: 41.30 | 15: iteration 72830/ 125429 | consumed samples: 18644480 | consumed tokens: 38183895040 | elapsed time per iteration (s): 1.06 | learning rate: 8.860E-05 | global batch size: 256 | lm loss: 1.951767E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.460 | TFLOPs: 40.07 | 15: iteration 72840/ 125429 | consumed samples: 18647040 | consumed tokens: 38189137920 | elapsed time per iteration (s): 1.05 | learning rate: 8.858E-05 | global batch size: 256 | lm loss: 1.966010E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.374 | TFLOPs: 40.22 | 15: iteration 72850/ 125429 | consumed samples: 18649600 | consumed tokens: 38194380800 | elapsed time per iteration (s): 1.11 | learning rate: 8.856E-05 | global batch size: 256 | lm loss: 1.978286E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.835 | TFLOPs: 37.98 | 15: iteration 72860/ 125429 | consumed samples: 18652160 | consumed tokens: 38199623680 | elapsed time per iteration (s): 1.04 | learning rate: 8.854E-05 | global batch size: 256 | lm loss: 1.959853E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.097 | TFLOPs: 40.83 | 15: iteration 72870/ 125429 | consumed samples: 18654720 | consumed tokens: 38204866560 | elapsed time per iteration (s): 1.07 | learning rate: 8.851E-05 | global batch size: 256 | lm loss: 1.957962E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.343 | TFLOPs: 39.55 | 15: iteration 72880/ 125429 | consumed samples: 18657280 | consumed tokens: 38210109440 | elapsed time per iteration (s): 1.04 | learning rate: 8.849E-05 | global batch size: 256 | lm loss: 1.967337E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.517 | TFLOPs: 40.74 | 15: iteration 72890/ 125429 | consumed samples: 18659840 | consumed tokens: 38215352320 | elapsed time per iteration (s): 1.04 | learning rate: 8.847E-05 | global batch size: 256 | lm loss: 1.968239E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.590 | TFLOPs: 40.59 | 15: iteration 72900/ 125429 | consumed samples: 18662400 | consumed tokens: 38220595200 | elapsed time per iteration (s): 1.05 | learning rate: 8.845E-05 | global batch size: 256 | lm loss: 1.932886E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.096 | TFLOPs: 40.17 | 15: iteration 72910/ 125429 | consumed samples: 18664960 | consumed tokens: 38225838080 | elapsed time per iteration (s): 1.03 | learning rate: 8.843E-05 | global batch size: 256 | lm loss: 1.943760E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.561 | TFLOPs: 41.08 | 15: iteration 72920/ 125429 | consumed samples: 18667520 | consumed tokens: 38231080960 | elapsed time per iteration (s): 1.04 | learning rate: 8.840E-05 | global batch size: 256 | lm loss: 1.951915E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.840 | TFLOPs: 40.79 | 15: iteration 72930/ 125429 | consumed samples: 18670080 | consumed tokens: 38236323840 | elapsed time per iteration (s): 1.03 | learning rate: 8.838E-05 | global batch size: 256 | lm loss: 1.952069E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.924 | TFLOPs: 41.14 | 15: iteration 72940/ 125429 | consumed samples: 18672640 | consumed tokens: 38241566720 | elapsed time per iteration (s): 1.08 | learning rate: 8.836E-05 | global batch size: 256 | lm loss: 1.926850E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.716 | TFLOPs: 39.28 | 15: iteration 72950/ 125429 | consumed samples: 18675200 | consumed tokens: 38246809600 | elapsed time per iteration (s): 1.06 | learning rate: 8.834E-05 | global batch size: 256 | lm loss: 1.988667E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.946 | TFLOPs: 39.82 | 15: iteration 72960/ 125429 | consumed samples: 18677760 | consumed tokens: 38252052480 | elapsed time per iteration (s): 1.09 | learning rate: 8.831E-05 | global batch size: 256 | lm loss: 1.965397E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.229 | TFLOPs: 38.71 | 15: iteration 72970/ 125429 | consumed samples: 18680320 | consumed tokens: 38257295360 | elapsed time per iteration (s): 1.02 | learning rate: 8.829E-05 | global batch size: 256 | lm loss: 1.947233E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.855 | TFLOPs: 41.46 | 15: iteration 72980/ 125429 | consumed samples: 18682880 | consumed tokens: 38262538240 | elapsed time per iteration (s): 1.04 | learning rate: 8.827E-05 | global batch size: 256 | lm loss: 1.974140E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.047 | TFLOPs: 40.66 | 15: iteration 72990/ 125429 | consumed samples: 18685440 | consumed tokens: 38267781120 | elapsed time per iteration (s): 1.04 | learning rate: 8.825E-05 | global batch size: 256 | lm loss: 1.989211E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.780 | TFLOPs: 40.62 | 15: iteration 73000/ 125429 | consumed samples: 18688000 | consumed tokens: 38273024000 | elapsed time per iteration (s): 1.04 | learning rate: 8.823E-05 | global batch size: 256 | lm loss: 1.959798E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.083 | TFLOPs: 40.83 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 73000 | lm loss value: 1.870638E+00 | lm loss PPL: 6.492437E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 73000 to checkpoints_1b5 0: [2022-11-26 17:41:20,395] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step73000 is begin to save! 0: [2022-11-26 17:41:20,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:41:20,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:41:20,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:41:20,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:41:20,743] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:41:20,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:41:20,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:41:20,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:41:20,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:41:21,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:41:21,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:41:21,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:41:21,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:41:21,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:41:21,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:41:21,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:41:21,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:41:21,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:41:21,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:41:21,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:41:21,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:41:21,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:41:21,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:41:21,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:41:21,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:41:21,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:41:21,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:41:22,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:41:22,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:41:22,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:41:22,172] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:41:22,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:41:22,277] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:41:22,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:41:22,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:41:22,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:41:22,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:41:22,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:41:22,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:41:22,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:41:22,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:41:22,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:41:22,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:41:22,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:41:22,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:41:23,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:41:23,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:41:23,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:41:23,149] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:41:23,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:41:23,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:41:23,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:41:23,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:41:23,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:41:23,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_29-model_00-model_states.pt... 0: [2022-11-26 17:41:23,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_29-model_00-model_states.pt. 0: [2022-11-26 17:41:23,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:41:23,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:41:23,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/layer_32-model_00-model_states.pt... 0: [2022-11-26 17:41:23,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/layer_32-model_00-model_states.pt. 0: [2022-11-26 17:41:23,683] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step73000/mp_rank_00_model_states.pt 0: [2022-11-26 17:41:23,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:41:23,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:41:23,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step73000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:41:23,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:41:23,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:41:23,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 17:41:23,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:41:23,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:41:23,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 17:41:23,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:41:23,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:41:23,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:41:23,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:41:23,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:41:23,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 17:41:23,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:41:23,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 17:41:23,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:41:23,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:41:23,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 17:41:23,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 17:41:23,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:41:23,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 11: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:41:23,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 17:41:23,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:41:23,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 1: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:41:23,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:41:23,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:41:23,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:41:23,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 17:41:23,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 9: [2022-11-26 17:41:23,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:41:23,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:41:23,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:41:23,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:41:23,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 14: [2022-11-26 17:41:23,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:41:23,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:41:23,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:41:23,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:41:23,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:41:23,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 10: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:41:23,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 10: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 17:41:23,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:41:23,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:41:23,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:41:23,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:41:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:41:23,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 17:41:23,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:41:23,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 5: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:41:23,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 6: [2022-11-26 17:41:23,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:41:23,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 17:41:23,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 2: [2022-11-26 17:41:23,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:41:23,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 17:41:23,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 3: [2022-11-26 17:41:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 17:41:23,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 17:41:23,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 8: [2022-11-26 17:41:23,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:41:23,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:41:23,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 1: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:41:23,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:41:23,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:41:23,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 17:41:23,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 17:41:23,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 4: [2022-11-26 17:41:23,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:41:23,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 17:41:23,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 17:41:23,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 7: [2022-11-26 17:41:23,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:41:23,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 17:41:23,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 17:41:23,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 13: [2022-11-26 17:41:23,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:41:23,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 17:41:23,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 17:41:23,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 11: [2022-11-26 17:41:23,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:41:23,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 17:41:23,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 12: [2022-11-26 17:41:23,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:41:23,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 17:41:23,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 15: [2022-11-26 17:41:23,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:41:23,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 17:41:23,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: [2022-11-26 17:41:23,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step73000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:41:23,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step73000 is ready now! 0: successfully saved checkpoint at iteration 73000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3642.14 15: iteration 73010/ 125429 | consumed samples: 18690560 | consumed tokens: 38278266880 | elapsed time per iteration (s): 1.43 | learning rate: 8.820E-05 | global batch size: 256 | lm loss: 1.971033E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.470 | TFLOPs: 29.66 | 15: iteration 73020/ 125429 | consumed samples: 18693120 | consumed tokens: 38283509760 | elapsed time per iteration (s): 1.02 | learning rate: 8.818E-05 | global batch size: 256 | lm loss: 1.973846E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.898 | TFLOPs: 41.30 | 15: iteration 73030/ 125429 | consumed samples: 18695680 | consumed tokens: 38288752640 | elapsed time per iteration (s): 1.03 | learning rate: 8.816E-05 | global batch size: 256 | lm loss: 1.951608E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.906 | TFLOPs: 40.97 | 15: iteration 73040/ 125429 | consumed samples: 18698240 | consumed tokens: 38293995520 | elapsed time per iteration (s): 1.03 | learning rate: 8.814E-05 | global batch size: 256 | lm loss: 1.970672E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.414 | TFLOPs: 41.05 | 15: iteration 73050/ 125429 | consumed samples: 18700800 | consumed tokens: 38299238400 | elapsed time per iteration (s): 1.03 | learning rate: 8.812E-05 | global batch size: 256 | lm loss: 1.972597E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.860 | TFLOPs: 40.96 | 15: iteration 73060/ 125429 | consumed samples: 18703360 | consumed tokens: 38304481280 | elapsed time per iteration (s): 1.02 | learning rate: 8.809E-05 | global batch size: 256 | lm loss: 1.965492E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.358 | TFLOPs: 41.54 | 15: iteration 73070/ 125429 | consumed samples: 18705920 | consumed tokens: 38309724160 | elapsed time per iteration (s): 1.06 | learning rate: 8.807E-05 | global batch size: 256 | lm loss: 1.958391E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.787 | TFLOPs: 39.79 | 15: iteration 73080/ 125429 | consumed samples: 18708480 | consumed tokens: 38314967040 | elapsed time per iteration (s): 1.04 | learning rate: 8.805E-05 | global batch size: 256 | lm loss: 1.947143E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.217 | TFLOPs: 40.69 | 15: iteration 73090/ 125429 | consumed samples: 18711040 | consumed tokens: 38320209920 | elapsed time per iteration (s): 1.06 | learning rate: 8.803E-05 | global batch size: 256 | lm loss: 1.962051E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.164 | TFLOPs: 39.85 | 15: iteration 73100/ 125429 | consumed samples: 18713600 | consumed tokens: 38325452800 | elapsed time per iteration (s): 1.05 | learning rate: 8.801E-05 | global batch size: 256 | lm loss: 1.978340E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.948 | TFLOPs: 40.31 | 15: iteration 73110/ 125429 | consumed samples: 18716160 | consumed tokens: 38330695680 | elapsed time per iteration (s): 1.04 | learning rate: 8.798E-05 | global batch size: 256 | lm loss: 1.946791E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.066 | TFLOPs: 40.83 | 15: iteration 73120/ 125429 | consumed samples: 18718720 | consumed tokens: 38335938560 | elapsed time per iteration (s): 1.03 | learning rate: 8.796E-05 | global batch size: 256 | lm loss: 1.982990E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.264 | TFLOPs: 41.19 | 15: iteration 73130/ 125429 | consumed samples: 18721280 | consumed tokens: 38341181440 | elapsed time per iteration (s): 1.03 | learning rate: 8.794E-05 | global batch size: 256 | lm loss: 1.973384E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.624 | TFLOPs: 41.09 | 15: iteration 73140/ 125429 | consumed samples: 18723840 | consumed tokens: 38346424320 | elapsed time per iteration (s): 1.02 | learning rate: 8.792E-05 | global batch size: 256 | lm loss: 1.952766E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.528 | TFLOPs: 41.57 | 15: iteration 73150/ 125429 | consumed samples: 18726400 | consumed tokens: 38351667200 | elapsed time per iteration (s): 1.04 | learning rate: 8.790E-05 | global batch size: 256 | lm loss: 1.953633E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.546 | TFLOPs: 40.74 | 15: iteration 73160/ 125429 | consumed samples: 18728960 | consumed tokens: 38356910080 | elapsed time per iteration (s): 1.10 | learning rate: 8.787E-05 | global batch size: 256 | lm loss: 1.940275E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.627 | TFLOPs: 38.61 | 15: iteration 73170/ 125429 | consumed samples: 18731520 | consumed tokens: 38362152960 | elapsed time per iteration (s): 1.04 | learning rate: 8.785E-05 | global batch size: 256 | lm loss: 1.945752E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.416 | TFLOPs: 40.56 | 15: iteration 73180/ 125429 | consumed samples: 18734080 | consumed tokens: 38367395840 | elapsed time per iteration (s): 1.06 | learning rate: 8.783E-05 | global batch size: 256 | lm loss: 1.966872E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.268 | TFLOPs: 40.04 | 15: iteration 73190/ 125429 | consumed samples: 18736640 | consumed tokens: 38372638720 | elapsed time per iteration (s): 1.04 | learning rate: 8.781E-05 | global batch size: 256 | lm loss: 1.942052E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.183 | TFLOPs: 40.85 | 15: iteration 73200/ 125429 | consumed samples: 18739200 | consumed tokens: 38377881600 | elapsed time per iteration (s): 1.04 | learning rate: 8.778E-05 | global batch size: 256 | lm loss: 1.977624E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.401 | TFLOPs: 40.55 | 15: iteration 73210/ 125429 | consumed samples: 18741760 | consumed tokens: 38383124480 | elapsed time per iteration (s): 1.04 | learning rate: 8.776E-05 | global batch size: 256 | lm loss: 1.987097E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.727 | TFLOPs: 40.77 | 15: iteration 73220/ 125429 | consumed samples: 18744320 | consumed tokens: 38388367360 | elapsed time per iteration (s): 1.03 | learning rate: 8.774E-05 | global batch size: 256 | lm loss: 1.929561E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.678 | TFLOPs: 41.26 | 15: iteration 73230/ 125429 | consumed samples: 18746880 | consumed tokens: 38393610240 | elapsed time per iteration (s): 1.06 | learning rate: 8.772E-05 | global batch size: 256 | lm loss: 1.957758E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.086 | TFLOPs: 40.01 | 15: iteration 73240/ 125429 | consumed samples: 18749440 | consumed tokens: 38398853120 | elapsed time per iteration (s): 1.05 | learning rate: 8.770E-05 | global batch size: 256 | lm loss: 1.959931E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.169 | TFLOPs: 40.19 | 15: iteration 73250/ 125429 | consumed samples: 18752000 | consumed tokens: 38404096000 | elapsed time per iteration (s): 1.03 | learning rate: 8.767E-05 | global batch size: 256 | lm loss: 1.990606E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.454 | TFLOPs: 41.22 | 15: iteration 73260/ 125429 | consumed samples: 18754560 | consumed tokens: 38409338880 | elapsed time per iteration (s): 1.05 | learning rate: 8.765E-05 | global batch size: 256 | lm loss: 1.968777E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.240 | TFLOPs: 40.36 | 15: iteration 73270/ 125429 | consumed samples: 18757120 | consumed tokens: 38414581760 | elapsed time per iteration (s): 1.05 | learning rate: 8.763E-05 | global batch size: 256 | lm loss: 1.945375E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.238 | TFLOPs: 40.20 | 15: iteration 73280/ 125429 | consumed samples: 18759680 | consumed tokens: 38419824640 | elapsed time per iteration (s): 1.07 | learning rate: 8.761E-05 | global batch size: 256 | lm loss: 1.960260E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.313 | TFLOPs: 39.71 | 15: iteration 73290/ 125429 | consumed samples: 18762240 | consumed tokens: 38425067520 | elapsed time per iteration (s): 1.03 | learning rate: 8.759E-05 | global batch size: 256 | lm loss: 1.966366E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.691 | TFLOPs: 41.10 | 15: iteration 73300/ 125429 | consumed samples: 18764800 | consumed tokens: 38430310400 | elapsed time per iteration (s): 1.06 | learning rate: 8.756E-05 | global batch size: 256 | lm loss: 1.958598E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.318 | TFLOPs: 40.04 | 15: iteration 73310/ 125429 | consumed samples: 18767360 | consumed tokens: 38435553280 | elapsed time per iteration (s): 1.49 | learning rate: 8.754E-05 | global batch size: 256 | lm loss: 1.933103E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.283 | TFLOPs: 28.31 | 15: iteration 73320/ 125429 | consumed samples: 18769920 | consumed tokens: 38440796160 | elapsed time per iteration (s): 1.05 | learning rate: 8.752E-05 | global batch size: 256 | lm loss: 1.960742E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.796 | TFLOPs: 40.29 | 15: iteration 73330/ 125429 | consumed samples: 18772480 | consumed tokens: 38446039040 | elapsed time per iteration (s): 1.04 | learning rate: 8.750E-05 | global batch size: 256 | lm loss: 1.958933E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.240 | TFLOPs: 40.69 | 15: iteration 73340/ 125429 | consumed samples: 18775040 | consumed tokens: 38451281920 | elapsed time per iteration (s): 1.07 | learning rate: 8.748E-05 | global batch size: 256 | lm loss: 1.965086E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.759 | TFLOPs: 39.46 | 15: iteration 73350/ 125429 | consumed samples: 18777600 | consumed tokens: 38456524800 | elapsed time per iteration (s): 1.05 | learning rate: 8.745E-05 | global batch size: 256 | lm loss: 1.967838E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.055 | TFLOPs: 40.33 | 15: iteration 73360/ 125429 | consumed samples: 18780160 | consumed tokens: 38461767680 | elapsed time per iteration (s): 1.15 | learning rate: 8.743E-05 | global batch size: 256 | lm loss: 1.944917E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.997 | TFLOPs: 36.69 | 15: iteration 73370/ 125429 | consumed samples: 18782720 | consumed tokens: 38467010560 | elapsed time per iteration (s): 1.04 | learning rate: 8.741E-05 | global batch size: 256 | lm loss: 1.948220E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.720 | TFLOPs: 40.61 | 15: iteration 73380/ 125429 | consumed samples: 18785280 | consumed tokens: 38472253440 | elapsed time per iteration (s): 1.03 | learning rate: 8.739E-05 | global batch size: 256 | lm loss: 1.956416E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.386 | TFLOPs: 41.21 | 15: iteration 73390/ 125429 | consumed samples: 18787840 | consumed tokens: 38477496320 | elapsed time per iteration (s): 1.05 | learning rate: 8.737E-05 | global batch size: 256 | lm loss: 1.945591E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.976 | TFLOPs: 40.48 | 15: iteration 73400/ 125429 | consumed samples: 18790400 | consumed tokens: 38482739200 | elapsed time per iteration (s): 1.04 | learning rate: 8.734E-05 | global batch size: 256 | lm loss: 1.958552E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.935 | TFLOPs: 40.64 | 15: iteration 73410/ 125429 | consumed samples: 18792960 | consumed tokens: 38487982080 | elapsed time per iteration (s): 1.03 | learning rate: 8.732E-05 | global batch size: 256 | lm loss: 1.955687E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.069 | TFLOPs: 41.00 | 15: iteration 73420/ 125429 | consumed samples: 18795520 | consumed tokens: 38493224960 | elapsed time per iteration (s): 1.03 | learning rate: 8.730E-05 | global batch size: 256 | lm loss: 1.960942E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.986 | TFLOPs: 41.15 | 15: iteration 73430/ 125429 | consumed samples: 18798080 | consumed tokens: 38498467840 | elapsed time per iteration (s): 1.03 | learning rate: 8.728E-05 | global batch size: 256 | lm loss: 1.975868E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.429 | TFLOPs: 41.05 | 15: iteration 73440/ 125429 | consumed samples: 18800640 | consumed tokens: 38503710720 | elapsed time per iteration (s): 1.04 | learning rate: 8.726E-05 | global batch size: 256 | lm loss: 1.954136E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.817 | TFLOPs: 40.79 | 15: iteration 73450/ 125429 | consumed samples: 18803200 | consumed tokens: 38508953600 | elapsed time per iteration (s): 1.03 | learning rate: 8.723E-05 | global batch size: 256 | lm loss: 1.936432E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.474 | TFLOPs: 41.06 | 15: iteration 73460/ 125429 | consumed samples: 18805760 | consumed tokens: 38514196480 | elapsed time per iteration (s): 1.06 | learning rate: 8.721E-05 | global batch size: 256 | lm loss: 1.931749E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.010 | TFLOPs: 39.99 | 15: iteration 73470/ 125429 | consumed samples: 18808320 | consumed tokens: 38519439360 | elapsed time per iteration (s): 1.18 | learning rate: 8.719E-05 | global batch size: 256 | lm loss: 1.952365E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.503 | TFLOPs: 35.94 | 15: iteration 73480/ 125429 | consumed samples: 18810880 | consumed tokens: 38524682240 | elapsed time per iteration (s): 1.05 | learning rate: 8.717E-05 | global batch size: 256 | lm loss: 1.945307E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.214 | TFLOPs: 40.19 | 15: iteration 73490/ 125429 | consumed samples: 18813440 | consumed tokens: 38529925120 | elapsed time per iteration (s): 1.06 | learning rate: 8.715E-05 | global batch size: 256 | lm loss: 1.953698E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.303 | TFLOPs: 39.88 | 15: iteration 73500/ 125429 | consumed samples: 18816000 | consumed tokens: 38535168000 | elapsed time per iteration (s): 1.04 | learning rate: 8.712E-05 | global batch size: 256 | lm loss: 1.969202E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.829 | TFLOPs: 40.79 | 15: iteration 73510/ 125429 | consumed samples: 18818560 | consumed tokens: 38540410880 | elapsed time per iteration (s): 1.08 | learning rate: 8.710E-05 | global batch size: 256 | lm loss: 1.966078E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.225 | TFLOPs: 39.04 | 15: iteration 73520/ 125429 | consumed samples: 18821120 | consumed tokens: 38545653760 | elapsed time per iteration (s): 1.05 | learning rate: 8.708E-05 | global batch size: 256 | lm loss: 1.982051E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.324 | TFLOPs: 40.21 | 15: iteration 73530/ 125429 | consumed samples: 18823680 | consumed tokens: 38550896640 | elapsed time per iteration (s): 1.04 | learning rate: 8.706E-05 | global batch size: 256 | lm loss: 1.961708E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.177 | TFLOPs: 40.68 | 15: iteration 73540/ 125429 | consumed samples: 18826240 | consumed tokens: 38556139520 | elapsed time per iteration (s): 1.05 | learning rate: 8.704E-05 | global batch size: 256 | lm loss: 1.968944E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.801 | TFLOPs: 40.29 | 15: iteration 73550/ 125429 | consumed samples: 18828800 | consumed tokens: 38561382400 | elapsed time per iteration (s): 1.03 | learning rate: 8.701E-05 | global batch size: 256 | lm loss: 2.004187E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.379 | TFLOPs: 40.88 | 15: iteration 73560/ 125429 | consumed samples: 18831360 | consumed tokens: 38566625280 | elapsed time per iteration (s): 1.05 | learning rate: 8.699E-05 | global batch size: 256 | lm loss: 1.969033E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.873 | TFLOPs: 40.47 | 15: iteration 73570/ 125429 | consumed samples: 18833920 | consumed tokens: 38571868160 | elapsed time per iteration (s): 1.03 | learning rate: 8.697E-05 | global batch size: 256 | lm loss: 1.924428E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.741 | TFLOPs: 41.27 | 15: iteration 73580/ 125429 | consumed samples: 18836480 | consumed tokens: 38577111040 | elapsed time per iteration (s): 1.05 | learning rate: 8.695E-05 | global batch size: 256 | lm loss: 1.943469E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.112 | TFLOPs: 40.34 | 15: iteration 73590/ 125429 | consumed samples: 18839040 | consumed tokens: 38582353920 | elapsed time per iteration (s): 1.17 | learning rate: 8.693E-05 | global batch size: 256 | lm loss: 1.995920E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.605 | TFLOPs: 36.29 | 15: iteration 73600/ 125429 | consumed samples: 18841600 | consumed tokens: 38587596800 | elapsed time per iteration (s): 1.03 | learning rate: 8.690E-05 | global batch size: 256 | lm loss: 1.949413E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.449 | TFLOPs: 41.06 | 15: iteration 73610/ 125429 | consumed samples: 18844160 | consumed tokens: 38592839680 | elapsed time per iteration (s): 1.06 | learning rate: 8.688E-05 | global batch size: 256 | lm loss: 1.967870E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.770 | TFLOPs: 39.95 | 15: iteration 73620/ 125429 | consumed samples: 18846720 | consumed tokens: 38598082560 | elapsed time per iteration (s): 1.03 | learning rate: 8.686E-05 | global batch size: 256 | lm loss: 1.935125E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.563 | TFLOPs: 41.24 | 15: iteration 73630/ 125429 | consumed samples: 18849280 | consumed tokens: 38603325440 | elapsed time per iteration (s): 1.07 | learning rate: 8.684E-05 | global batch size: 256 | lm loss: 1.966565E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.867 | TFLOPs: 39.64 | 15: iteration 73640/ 125429 | consumed samples: 18851840 | consumed tokens: 38608568320 | elapsed time per iteration (s): 1.08 | learning rate: 8.682E-05 | global batch size: 256 | lm loss: 1.965627E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.730 | TFLOPs: 39.29 | 15: iteration 73650/ 125429 | consumed samples: 18854400 | consumed tokens: 38613811200 | elapsed time per iteration (s): 1.03 | learning rate: 8.679E-05 | global batch size: 256 | lm loss: 1.943635E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.731 | TFLOPs: 41.10 | 15: iteration 73660/ 125429 | consumed samples: 18856960 | consumed tokens: 38619054080 | elapsed time per iteration (s): 1.05 | learning rate: 8.677E-05 | global batch size: 256 | lm loss: 1.999543E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.473 | TFLOPs: 40.40 | 15: iteration 73670/ 125429 | consumed samples: 18859520 | consumed tokens: 38624296960 | elapsed time per iteration (s): 1.05 | learning rate: 8.675E-05 | global batch size: 256 | lm loss: 1.948072E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.693 | TFLOPs: 40.44 | 15: iteration 73680/ 125429 | consumed samples: 18862080 | consumed tokens: 38629539840 | elapsed time per iteration (s): 1.15 | learning rate: 8.673E-05 | global batch size: 256 | lm loss: 1.968542E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.011 | TFLOPs: 36.85 | 15: iteration 73690/ 125429 | consumed samples: 18864640 | consumed tokens: 38634782720 | elapsed time per iteration (s): 1.18 | learning rate: 8.671E-05 | global batch size: 256 | lm loss: 1.961245E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.816 | TFLOPs: 35.83 | 15: iteration 73700/ 125429 | consumed samples: 18867200 | consumed tokens: 38640025600 | elapsed time per iteration (s): 1.19 | learning rate: 8.668E-05 | global batch size: 256 | lm loss: 1.955136E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.178 | TFLOPs: 35.56 | 15: iteration 73710/ 125429 | consumed samples: 18869760 | consumed tokens: 38645268480 | elapsed time per iteration (s): 1.17 | learning rate: 8.666E-05 | global batch size: 256 | lm loss: 1.962446E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.473 | TFLOPs: 36.10 | 15: iteration 73720/ 125429 | consumed samples: 18872320 | consumed tokens: 38650511360 | elapsed time per iteration (s): 1.04 | learning rate: 8.664E-05 | global batch size: 256 | lm loss: 1.973739E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.476 | TFLOPs: 40.73 | 15: iteration 73730/ 125429 | consumed samples: 18874880 | consumed tokens: 38655754240 | elapsed time per iteration (s): 1.11 | learning rate: 8.662E-05 | global batch size: 256 | lm loss: 1.960312E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.263 | TFLOPs: 38.22 | 15: iteration 73740/ 125429 | consumed samples: 18877440 | consumed tokens: 38660997120 | elapsed time per iteration (s): 1.04 | learning rate: 8.660E-05 | global batch size: 256 | lm loss: 1.984008E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.680 | TFLOPs: 40.60 | 15: iteration 73750/ 125429 | consumed samples: 18880000 | consumed tokens: 38666240000 | elapsed time per iteration (s): 1.05 | learning rate: 8.657E-05 | global batch size: 256 | lm loss: 1.964726E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.029 | TFLOPs: 40.16 | 15: iteration 73760/ 125429 | consumed samples: 18882560 | consumed tokens: 38671482880 | elapsed time per iteration (s): 1.04 | learning rate: 8.655E-05 | global batch size: 256 | lm loss: 1.944873E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.106 | TFLOPs: 40.67 | 15: iteration 73770/ 125429 | consumed samples: 18885120 | consumed tokens: 38676725760 | elapsed time per iteration (s): 1.04 | learning rate: 8.653E-05 | global batch size: 256 | lm loss: 1.957254E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.893 | TFLOPs: 40.80 | 15: iteration 73780/ 125429 | consumed samples: 18887680 | consumed tokens: 38681968640 | elapsed time per iteration (s): 1.05 | learning rate: 8.651E-05 | global batch size: 256 | lm loss: 1.946396E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.166 | TFLOPs: 40.35 | 15: iteration 73790/ 125429 | consumed samples: 18890240 | consumed tokens: 38687211520 | elapsed time per iteration (s): 1.07 | learning rate: 8.649E-05 | global batch size: 256 | lm loss: 1.980037E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.484 | TFLOPs: 39.58 | 15: iteration 73800/ 125429 | consumed samples: 18892800 | consumed tokens: 38692454400 | elapsed time per iteration (s): 1.09 | learning rate: 8.646E-05 | global batch size: 256 | lm loss: 1.928704E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.700 | TFLOPs: 38.95 | 15: iteration 73810/ 125429 | consumed samples: 18895360 | consumed tokens: 38697697280 | elapsed time per iteration (s): 1.07 | learning rate: 8.644E-05 | global batch size: 256 | lm loss: 1.950332E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.971 | TFLOPs: 39.49 | 15: iteration 73820/ 125429 | consumed samples: 18897920 | consumed tokens: 38702940160 | elapsed time per iteration (s): 1.07 | learning rate: 8.642E-05 | global batch size: 256 | lm loss: 1.949729E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.596 | TFLOPs: 39.43 | 15: iteration 73830/ 125429 | consumed samples: 18900480 | consumed tokens: 38708183040 | elapsed time per iteration (s): 1.02 | learning rate: 8.640E-05 | global batch size: 256 | lm loss: 1.975887E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.328 | TFLOPs: 41.37 | 15: iteration 73840/ 125429 | consumed samples: 18903040 | consumed tokens: 38713425920 | elapsed time per iteration (s): 1.03 | learning rate: 8.638E-05 | global batch size: 256 | lm loss: 1.977987E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.415 | TFLOPs: 40.89 | 15: iteration 73850/ 125429 | consumed samples: 18905600 | consumed tokens: 38718668800 | elapsed time per iteration (s): 1.03 | learning rate: 8.635E-05 | global batch size: 256 | lm loss: 1.955423E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.806 | TFLOPs: 41.12 | 15: iteration 73860/ 125429 | consumed samples: 18908160 | consumed tokens: 38723911680 | elapsed time per iteration (s): 1.10 | learning rate: 8.633E-05 | global batch size: 256 | lm loss: 1.971765E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.769 | TFLOPs: 38.47 | 15: iteration 73870/ 125429 | consumed samples: 18910720 | consumed tokens: 38729154560 | elapsed time per iteration (s): 1.18 | learning rate: 8.631E-05 | global batch size: 256 | lm loss: 1.967034E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.778 | TFLOPs: 35.82 | 15: iteration 73880/ 125429 | consumed samples: 18913280 | consumed tokens: 38734397440 | elapsed time per iteration (s): 1.06 | learning rate: 8.629E-05 | global batch size: 256 | lm loss: 1.965470E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.466 | TFLOPs: 39.90 | 15: iteration 73890/ 125429 | consumed samples: 18915840 | consumed tokens: 38739640320 | elapsed time per iteration (s): 1.05 | learning rate: 8.627E-05 | global batch size: 256 | lm loss: 1.962129E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.227 | TFLOPs: 40.36 | 15: iteration 73900/ 125429 | consumed samples: 18918400 | consumed tokens: 38744883200 | elapsed time per iteration (s): 1.05 | learning rate: 8.624E-05 | global batch size: 256 | lm loss: 1.953699E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.812 | TFLOPs: 40.46 | 15: iteration 73910/ 125429 | consumed samples: 18920960 | consumed tokens: 38750126080 | elapsed time per iteration (s): 1.04 | learning rate: 8.622E-05 | global batch size: 256 | lm loss: 1.940336E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.476 | TFLOPs: 40.57 | 15: iteration 73920/ 125429 | consumed samples: 18923520 | consumed tokens: 38755368960 | elapsed time per iteration (s): 1.04 | learning rate: 8.620E-05 | global batch size: 256 | lm loss: 1.989504E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.087 | TFLOPs: 40.83 | 15: iteration 73930/ 125429 | consumed samples: 18926080 | consumed tokens: 38760611840 | elapsed time per iteration (s): 1.05 | learning rate: 8.618E-05 | global batch size: 256 | lm loss: 1.940271E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.939 | TFLOPs: 40.15 | 15: iteration 73940/ 125429 | consumed samples: 18928640 | consumed tokens: 38765854720 | elapsed time per iteration (s): 1.05 | learning rate: 8.616E-05 | global batch size: 256 | lm loss: 1.945774E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.775 | TFLOPs: 40.29 | 15: iteration 73950/ 125429 | consumed samples: 18931200 | consumed tokens: 38771097600 | elapsed time per iteration (s): 1.07 | learning rate: 8.613E-05 | global batch size: 256 | lm loss: 1.983471E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.913 | TFLOPs: 39.48 | 15: iteration 73960/ 125429 | consumed samples: 18933760 | consumed tokens: 38776340480 | elapsed time per iteration (s): 1.06 | learning rate: 8.611E-05 | global batch size: 256 | lm loss: 1.919650E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.449 | TFLOPs: 39.74 | 15: iteration 73970/ 125429 | consumed samples: 18936320 | consumed tokens: 38781583360 | elapsed time per iteration (s): 1.04 | learning rate: 8.609E-05 | global batch size: 256 | lm loss: 1.975273E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.963 | TFLOPs: 40.65 | 15: iteration 73980/ 125429 | consumed samples: 18938880 | consumed tokens: 38786826240 | elapsed time per iteration (s): 1.33 | learning rate: 8.607E-05 | global batch size: 256 | lm loss: 1.977828E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 192.743 | TFLOPs: 31.85 | 15: iteration 73990/ 125429 | consumed samples: 18941440 | consumed tokens: 38792069120 | elapsed time per iteration (s): 1.03 | learning rate: 8.605E-05 | global batch size: 256 | lm loss: 1.954138E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.281 | TFLOPs: 41.20 | 0: [2022-11-26 17:59:07,799] [INFO] [logging.py:68:log_dist] [Rank 0] step=74000, skipped=0, lr=[8.60242566850819e-05, 8.60242566850819e-05, 8.60242566850819e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 74000/ 125429 | consumed samples: 18944000 | consumed tokens: 38797312000 | elapsed time per iteration (s): 1.05 | learning rate: 8.602E-05 | global batch size: 256 | lm loss: 1.959010E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.132 | TFLOPs: 40.34 | 0: steps: 74000 loss: 1.9188 iter time (s): 1.052 samples/sec: 243.248 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 74000 | lm loss value: 1.855560E+00 | lm loss PPL: 6.395277E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 74000 to checkpoints_1b5 0: [2022-11-26 17:59:08,192] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step74000 is begin to save! 0: [2022-11-26 17:59:08,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_01-model_00-model_states.pt... 0: [2022-11-26 17:59:08,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_01-model_00-model_states.pt. 0: [2022-11-26 17:59:08,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_03-model_00-model_states.pt... 0: [2022-11-26 17:59:08,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_03-model_00-model_states.pt. 0: [2022-11-26 17:59:08,618] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_04-model_00-model_states.pt... 0: [2022-11-26 17:59:08,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_04-model_00-model_states.pt. 0: [2022-11-26 17:59:08,738] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_05-model_00-model_states.pt... 0: [2022-11-26 17:59:08,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_05-model_00-model_states.pt. 0: [2022-11-26 17:59:08,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_06-model_00-model_states.pt... 0: [2022-11-26 17:59:08,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_06-model_00-model_states.pt. 0: [2022-11-26 17:59:08,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_07-model_00-model_states.pt... 0: [2022-11-26 17:59:09,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_07-model_00-model_states.pt. 0: [2022-11-26 17:59:09,085] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_08-model_00-model_states.pt... 0: [2022-11-26 17:59:09,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_08-model_00-model_states.pt. 0: [2022-11-26 17:59:09,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_09-model_00-model_states.pt... 0: [2022-11-26 17:59:09,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_09-model_00-model_states.pt. 0: [2022-11-26 17:59:09,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_10-model_00-model_states.pt... 0: [2022-11-26 17:59:09,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_10-model_00-model_states.pt. 0: [2022-11-26 17:59:09,437] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_11-model_00-model_states.pt... 0: [2022-11-26 17:59:09,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_11-model_00-model_states.pt. 0: [2022-11-26 17:59:09,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_12-model_00-model_states.pt... 0: [2022-11-26 17:59:09,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_12-model_00-model_states.pt. 0: [2022-11-26 17:59:09,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_13-model_00-model_states.pt... 0: [2022-11-26 17:59:09,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_13-model_00-model_states.pt. 0: [2022-11-26 17:59:09,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_14-model_00-model_states.pt... 0: [2022-11-26 17:59:09,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_14-model_00-model_states.pt. 0: [2022-11-26 17:59:09,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_15-model_00-model_states.pt... 0: [2022-11-26 17:59:10,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_15-model_00-model_states.pt. 0: [2022-11-26 17:59:10,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_16-model_00-model_states.pt... 0: [2022-11-26 17:59:10,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_16-model_00-model_states.pt. 0: [2022-11-26 17:59:10,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_17-model_00-model_states.pt... 0: [2022-11-26 17:59:10,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_17-model_00-model_states.pt. 0: [2022-11-26 17:59:10,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_18-model_00-model_states.pt... 0: [2022-11-26 17:59:10,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_18-model_00-model_states.pt. 0: [2022-11-26 17:59:10,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_19-model_00-model_states.pt... 0: [2022-11-26 17:59:10,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_19-model_00-model_states.pt. 0: [2022-11-26 17:59:10,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_20-model_00-model_states.pt... 0: [2022-11-26 17:59:10,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_20-model_00-model_states.pt. 0: [2022-11-26 17:59:10,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_21-model_00-model_states.pt... 0: [2022-11-26 17:59:10,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_21-model_00-model_states.pt. 0: [2022-11-26 17:59:10,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_22-model_00-model_states.pt... 0: [2022-11-26 17:59:10,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_22-model_00-model_states.pt. 0: [2022-11-26 17:59:10,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_23-model_00-model_states.pt... 0: [2022-11-26 17:59:10,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_23-model_00-model_states.pt. 0: [2022-11-26 17:59:10,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_24-model_00-model_states.pt... 0: [2022-11-26 17:59:11,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_24-model_00-model_states.pt. 0: [2022-11-26 17:59:11,041] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_25-model_00-model_states.pt... 0: [2022-11-26 17:59:11,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_25-model_00-model_states.pt. 0: [2022-11-26 17:59:11,147] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_26-model_00-model_states.pt... 0: [2022-11-26 17:59:11,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_26-model_00-model_states.pt. 0: [2022-11-26 17:59:11,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_27-model_00-model_states.pt... 0: [2022-11-26 17:59:11,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_27-model_00-model_states.pt. 0: [2022-11-26 17:59:11,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_28-model_00-model_states.pt... 0: [2022-11-26 17:59:11,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_28-model_00-model_states.pt. 0: [2022-11-26 17:59:11,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_29-model_00-model_states.pt... 0: [2022-11-26 17:59:11,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_29-model_00-model_states.pt. 0: [2022-11-26 17:59:11,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_30-model_00-model_states.pt... 0: [2022-11-26 17:59:11,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_30-model_00-model_states.pt. 0: [2022-11-26 17:59:11,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/layer_32-model_00-model_states.pt... 0: [2022-11-26 17:59:11,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/layer_32-model_00-model_states.pt. 0: [2022-11-26 17:59:11,688] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step74000/mp_rank_00_model_states.pt 0: [2022-11-26 17:59:11,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/mp_rank_00_model_states.pt... 0: [2022-11-26 17:59:11,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/mp_rank_00_model_states.pt. 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 3: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 11: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 5: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-26 17:59:11,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step74000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 0: [2022-11-26 17:59:12,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:59:12,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 17:59:12,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 17:59:12,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 5: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:59:12,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 17:59:12,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:59:12,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:59:12,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:59:12,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 7: [2022-11-26 17:59:12,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:59:12,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:59:12,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 17:59:12,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 17:59:12,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 2: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 17:59:12,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 17:59:12,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 17:59:12,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 17:59:12,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 17:59:12,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 12: [2022-11-26 17:59:12,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 17:59:12,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 17:59:12,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 12: [2022-11-26 17:59:12,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 17:59:12,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 17:59:12,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:59:12,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:59:12,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:59:12,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:59:12,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 17:59:12,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 17:59:12,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:59:12,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 17:59:12,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 12: [2022-11-26 17:59:12,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 17:59:12,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 17:59:12,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:59:12,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 17:59:12,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 17:59:12,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 17:59:12,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 9: [2022-11-26 17:59:12,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 17:59:12,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 17:59:12,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 17:59:12,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 17:59:12,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 17:59:12,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:59:12,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 17:59:12,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 17:59:12,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 10: [2022-11-26 17:59:12,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:59:12,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:59:12,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:59:12,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:59:12,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 17:59:12,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 17:59:12,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 17:59:12,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 17:59:12,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 6: [2022-11-26 17:59:12,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 17:59:12,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:59:12,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 17:59:12,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 17:59:12,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:59:12,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:59:12,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 7: [2022-11-26 17:59:12,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 2: [2022-11-26 17:59:12,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 2: [2022-11-26 17:59:12,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 8: [2022-11-26 17:59:12,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 17:59:12,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 7: [2022-11-26 17:59:12,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 17:59:12,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 17:59:12,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 13: [2022-11-26 17:59:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 17:59:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 17:59:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:59:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 3: [2022-11-26 17:59:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 17:59:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 17:59:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 5: [2022-11-26 17:59:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 17:59:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 17:59:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:59:12,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 11: [2022-11-26 17:59:12,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 17:59:12,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 17:59:12,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 4: [2022-11-26 17:59:12,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 17:59:12,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 17:59:12,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: [2022-11-26 17:59:12,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 17:59:12,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 1: [2022-11-26 17:59:12,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 17:59:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 14: [2022-11-26 17:59:12,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 17:59:12,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 15: [2022-11-26 17:59:12,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step74000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 17:59:12,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step74000 is ready now! 0: successfully saved checkpoint at iteration 74000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4243.73 15: iteration 74010/ 125429 | consumed samples: 18946560 | consumed tokens: 38802554880 | elapsed time per iteration (s): 1.51 | learning rate: 8.600E-05 | global batch size: 256 | lm loss: 1.964395E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 169.504 | TFLOPs: 28.01 | 15: iteration 74020/ 125429 | consumed samples: 18949120 | consumed tokens: 38807797760 | elapsed time per iteration (s): 1.04 | learning rate: 8.598E-05 | global batch size: 256 | lm loss: 1.926641E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.323 | TFLOPs: 40.87 | 15: iteration 74030/ 125429 | consumed samples: 18951680 | consumed tokens: 38813040640 | elapsed time per iteration (s): 1.15 | learning rate: 8.596E-05 | global batch size: 256 | lm loss: 1.952202E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.369 | TFLOPs: 36.91 | 15: iteration 74040/ 125429 | consumed samples: 18954240 | consumed tokens: 38818283520 | elapsed time per iteration (s): 1.03 | learning rate: 8.594E-05 | global batch size: 256 | lm loss: 1.949786E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.719 | TFLOPs: 41.10 | 15: iteration 74050/ 125429 | consumed samples: 18956800 | consumed tokens: 38823526400 | elapsed time per iteration (s): 1.04 | learning rate: 8.591E-05 | global batch size: 256 | lm loss: 1.956504E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.480 | TFLOPs: 40.73 | 15: iteration 74060/ 125429 | consumed samples: 18959360 | consumed tokens: 38828769280 | elapsed time per iteration (s): 1.03 | learning rate: 8.589E-05 | global batch size: 256 | lm loss: 1.959565E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.299 | TFLOPs: 41.03 | 15: iteration 74070/ 125429 | consumed samples: 18961920 | consumed tokens: 38834012160 | elapsed time per iteration (s): 1.03 | learning rate: 8.587E-05 | global batch size: 256 | lm loss: 1.956090E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.654 | TFLOPs: 41.09 | 15: iteration 74080/ 125429 | consumed samples: 18964480 | consumed tokens: 38839255040 | elapsed time per iteration (s): 1.04 | learning rate: 8.585E-05 | global batch size: 256 | lm loss: 1.932970E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.547 | TFLOPs: 40.74 | 15: iteration 74090/ 125429 | consumed samples: 18967040 | consumed tokens: 38844497920 | elapsed time per iteration (s): 1.03 | learning rate: 8.583E-05 | global batch size: 256 | lm loss: 1.976655E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.690 | TFLOPs: 40.93 | 15: iteration 74100/ 125429 | consumed samples: 18969600 | consumed tokens: 38849740800 | elapsed time per iteration (s): 1.02 | learning rate: 8.580E-05 | global batch size: 256 | lm loss: 1.981037E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.165 | TFLOPs: 41.34 | 15: iteration 74110/ 125429 | consumed samples: 18972160 | consumed tokens: 38854983680 | elapsed time per iteration (s): 1.04 | learning rate: 8.578E-05 | global batch size: 256 | lm loss: 1.935683E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.170 | TFLOPs: 40.85 | 15: iteration 74120/ 125429 | consumed samples: 18974720 | consumed tokens: 38860226560 | elapsed time per iteration (s): 1.03 | learning rate: 8.576E-05 | global batch size: 256 | lm loss: 1.949688E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.724 | TFLOPs: 41.10 | 15: iteration 74130/ 125429 | consumed samples: 18977280 | consumed tokens: 38865469440 | elapsed time per iteration (s): 1.03 | learning rate: 8.574E-05 | global batch size: 256 | lm loss: 1.953262E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.795 | TFLOPs: 40.95 | 15: iteration 74140/ 125429 | consumed samples: 18979840 | consumed tokens: 38870712320 | elapsed time per iteration (s): 1.07 | learning rate: 8.572E-05 | global batch size: 256 | lm loss: 1.937900E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.242 | TFLOPs: 39.70 | 15: iteration 74150/ 125429 | consumed samples: 18982400 | consumed tokens: 38875955200 | elapsed time per iteration (s): 1.09 | learning rate: 8.570E-05 | global batch size: 256 | lm loss: 1.964035E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.943 | TFLOPs: 38.83 | 15: iteration 74160/ 125429 | consumed samples: 18984960 | consumed tokens: 38881198080 | elapsed time per iteration (s): 1.06 | learning rate: 8.567E-05 | global batch size: 256 | lm loss: 1.960522E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.133 | TFLOPs: 40.01 | 15: iteration 74170/ 125429 | consumed samples: 18987520 | consumed tokens: 38886440960 | elapsed time per iteration (s): 1.03 | learning rate: 8.565E-05 | global batch size: 256 | lm loss: 1.966754E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.928 | TFLOPs: 40.97 | 15: iteration 74180/ 125429 | consumed samples: 18990080 | consumed tokens: 38891683840 | elapsed time per iteration (s): 1.06 | learning rate: 8.563E-05 | global batch size: 256 | lm loss: 1.964453E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.221 | TFLOPs: 40.03 | 15: iteration 74190/ 125429 | consumed samples: 18992640 | consumed tokens: 38896926720 | elapsed time per iteration (s): 1.03 | learning rate: 8.561E-05 | global batch size: 256 | lm loss: 1.969291E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.586 | TFLOPs: 41.25 | 15: iteration 74200/ 125429 | consumed samples: 18995200 | consumed tokens: 38902169600 | elapsed time per iteration (s): 1.03 | learning rate: 8.559E-05 | global batch size: 256 | lm loss: 1.944804E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.811 | TFLOPs: 41.12 | 15: iteration 74210/ 125429 | consumed samples: 18997760 | consumed tokens: 38907412480 | elapsed time per iteration (s): 1.06 | learning rate: 8.556E-05 | global batch size: 256 | lm loss: 1.965210E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.476 | TFLOPs: 39.91 | 15: iteration 74220/ 125429 | consumed samples: 19000320 | consumed tokens: 38912655360 | elapsed time per iteration (s): 1.03 | learning rate: 8.554E-05 | global batch size: 256 | lm loss: 1.983202E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.105 | TFLOPs: 41.00 | 15: iteration 74230/ 125429 | consumed samples: 19002880 | consumed tokens: 38917898240 | elapsed time per iteration (s): 1.06 | learning rate: 8.552E-05 | global batch size: 256 | lm loss: 1.951173E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.694 | TFLOPs: 39.78 | 15: iteration 74240/ 125429 | consumed samples: 19005440 | consumed tokens: 38923141120 | elapsed time per iteration (s): 1.02 | learning rate: 8.550E-05 | global batch size: 256 | lm loss: 1.965700E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.846 | TFLOPs: 41.29 | 15: iteration 74250/ 125429 | consumed samples: 19008000 | consumed tokens: 38928384000 | elapsed time per iteration (s): 1.05 | learning rate: 8.548E-05 | global batch size: 256 | lm loss: 1.926445E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.963 | TFLOPs: 40.15 | 15: iteration 74260/ 125429 | consumed samples: 19010560 | consumed tokens: 38933626880 | elapsed time per iteration (s): 1.05 | learning rate: 8.545E-05 | global batch size: 256 | lm loss: 1.979734E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.306 | TFLOPs: 40.37 | 15: iteration 74270/ 125429 | consumed samples: 19013120 | consumed tokens: 38938869760 | elapsed time per iteration (s): 1.06 | learning rate: 8.543E-05 | global batch size: 256 | lm loss: 1.986426E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.535 | TFLOPs: 40.08 | 15: iteration 74280/ 125429 | consumed samples: 19015680 | consumed tokens: 38944112640 | elapsed time per iteration (s): 1.09 | learning rate: 8.541E-05 | global batch size: 256 | lm loss: 1.951408E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.593 | TFLOPs: 38.93 | 15: iteration 74290/ 125429 | consumed samples: 19018240 | consumed tokens: 38949355520 | elapsed time per iteration (s): 1.02 | learning rate: 8.539E-05 | global batch size: 256 | lm loss: 1.942870E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.800 | TFLOPs: 41.45 | 15: iteration 74300/ 125429 | consumed samples: 19020800 | consumed tokens: 38954598400 | elapsed time per iteration (s): 1.06 | learning rate: 8.537E-05 | global batch size: 256 | lm loss: 1.976881E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.386 | TFLOPs: 39.73 | 15: iteration 74310/ 125429 | consumed samples: 19023360 | consumed tokens: 38959841280 | elapsed time per iteration (s): 1.12 | learning rate: 8.534E-05 | global batch size: 256 | lm loss: 1.943906E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.175 | TFLOPs: 37.71 | 15: iteration 74320/ 125429 | consumed samples: 19025920 | consumed tokens: 38965084160 | elapsed time per iteration (s): 1.22 | learning rate: 8.532E-05 | global batch size: 256 | lm loss: 1.957486E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 209.635 | TFLOPs: 34.64 | 15: iteration 74330/ 125429 | consumed samples: 19028480 | consumed tokens: 38970327040 | elapsed time per iteration (s): 1.03 | learning rate: 8.530E-05 | global batch size: 256 | lm loss: 1.957268E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.691 | TFLOPs: 41.26 | 15: iteration 74340/ 125429 | consumed samples: 19031040 | consumed tokens: 38975569920 | elapsed time per iteration (s): 1.05 | learning rate: 8.528E-05 | global batch size: 256 | lm loss: 1.932114E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.468 | TFLOPs: 40.40 | 15: iteration 74350/ 125429 | consumed samples: 19033600 | consumed tokens: 38980812800 | elapsed time per iteration (s): 1.03 | learning rate: 8.526E-05 | global batch size: 256 | lm loss: 1.979932E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.919 | TFLOPs: 41.14 | 15: iteration 74360/ 125429 | consumed samples: 19036160 | consumed tokens: 38986055680 | elapsed time per iteration (s): 1.04 | learning rate: 8.524E-05 | global batch size: 256 | lm loss: 1.955065E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.733 | TFLOPs: 40.61 | 15: iteration 74370/ 125429 | consumed samples: 19038720 | consumed tokens: 38991298560 | elapsed time per iteration (s): 1.05 | learning rate: 8.521E-05 | global batch size: 256 | lm loss: 1.970111E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.684 | TFLOPs: 40.44 | 15: iteration 74380/ 125429 | consumed samples: 19041280 | consumed tokens: 38996541440 | elapsed time per iteration (s): 1.03 | learning rate: 8.519E-05 | global batch size: 256 | lm loss: 1.918989E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.411 | TFLOPs: 40.89 | 15: iteration 74390/ 125429 | consumed samples: 19043840 | consumed tokens: 39001784320 | elapsed time per iteration (s): 1.05 | learning rate: 8.517E-05 | global batch size: 256 | lm loss: 1.955906E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.772 | TFLOPs: 40.12 | 15: iteration 74400/ 125429 | consumed samples: 19046400 | consumed tokens: 39007027200 | elapsed time per iteration (s): 1.05 | learning rate: 8.515E-05 | global batch size: 256 | lm loss: 1.972131E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.576 | TFLOPs: 40.42 | 15: iteration 74410/ 125429 | consumed samples: 19048960 | consumed tokens: 39012270080 | elapsed time per iteration (s): 1.04 | learning rate: 8.513E-05 | global batch size: 256 | lm loss: 1.953260E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.036 | TFLOPs: 40.66 | 15: iteration 74420/ 125429 | consumed samples: 19051520 | consumed tokens: 39017512960 | elapsed time per iteration (s): 1.02 | learning rate: 8.510E-05 | global batch size: 256 | lm loss: 1.949043E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.524 | TFLOPs: 41.57 | 15: iteration 74430/ 125429 | consumed samples: 19054080 | consumed tokens: 39022755840 | elapsed time per iteration (s): 1.02 | learning rate: 8.508E-05 | global batch size: 256 | lm loss: 1.964623E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.300 | TFLOPs: 41.53 | 15: iteration 74440/ 125429 | consumed samples: 19056640 | consumed tokens: 39027998720 | elapsed time per iteration (s): 1.05 | learning rate: 8.506E-05 | global batch size: 256 | lm loss: 1.959238E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.439 | TFLOPs: 40.23 | 15: iteration 74450/ 125429 | consumed samples: 19059200 | consumed tokens: 39033241600 | elapsed time per iteration (s): 1.04 | learning rate: 8.504E-05 | global batch size: 256 | lm loss: 1.961384E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.986 | TFLOPs: 40.82 | 15: iteration 74460/ 125429 | consumed samples: 19061760 | consumed tokens: 39038484480 | elapsed time per iteration (s): 1.06 | learning rate: 8.502E-05 | global batch size: 256 | lm loss: 1.936164E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.638 | TFLOPs: 40.10 | 15: iteration 74470/ 125429 | consumed samples: 19064320 | consumed tokens: 39043727360 | elapsed time per iteration (s): 1.07 | learning rate: 8.499E-05 | global batch size: 256 | lm loss: 1.971888E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.533 | TFLOPs: 39.58 | 15: iteration 74480/ 125429 | consumed samples: 19066880 | consumed tokens: 39048970240 | elapsed time per iteration (s): 1.04 | learning rate: 8.497E-05 | global batch size: 256 | lm loss: 1.963814E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.985 | TFLOPs: 40.82 | 15: iteration 74490/ 125429 | consumed samples: 19069440 | consumed tokens: 39054213120 | elapsed time per iteration (s): 1.04 | learning rate: 8.495E-05 | global batch size: 256 | lm loss: 1.956823E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.415 | TFLOPs: 40.72 | 15: iteration 74500/ 125429 | consumed samples: 19072000 | consumed tokens: 39059456000 | elapsed time per iteration (s): 1.04 | learning rate: 8.493E-05 | global batch size: 256 | lm loss: 1.974814E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.207 | TFLOPs: 40.85 | 15: iteration 74510/ 125429 | consumed samples: 19074560 | consumed tokens: 39064698880 | elapsed time per iteration (s): 1.05 | learning rate: 8.491E-05 | global batch size: 256 | lm loss: 1.960892E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.566 | TFLOPs: 40.42 | 15: iteration 74520/ 125429 | consumed samples: 19077120 | consumed tokens: 39069941760 | elapsed time per iteration (s): 1.05 | learning rate: 8.489E-05 | global batch size: 256 | lm loss: 1.983649E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.444 | TFLOPs: 40.23 | 15: iteration 74530/ 125429 | consumed samples: 19079680 | consumed tokens: 39075184640 | elapsed time per iteration (s): 1.04 | learning rate: 8.486E-05 | global batch size: 256 | lm loss: 1.960240E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.003 | TFLOPs: 40.65 | 15: iteration 74540/ 125429 | consumed samples: 19082240 | consumed tokens: 39080427520 | elapsed time per iteration (s): 1.06 | learning rate: 8.484E-05 | global batch size: 256 | lm loss: 1.949660E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.067 | TFLOPs: 40.00 | 15: iteration 74550/ 125429 | consumed samples: 19084800 | consumed tokens: 39085670400 | elapsed time per iteration (s): 1.04 | learning rate: 8.482E-05 | global batch size: 256 | lm loss: 1.949663E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.010 | TFLOPs: 40.66 | 15: iteration 74560/ 125429 | consumed samples: 19087360 | consumed tokens: 39090913280 | elapsed time per iteration (s): 1.06 | learning rate: 8.480E-05 | global batch size: 256 | lm loss: 1.961999E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.069 | TFLOPs: 39.84 | 15: iteration 74570/ 125429 | consumed samples: 19089920 | consumed tokens: 39096156160 | elapsed time per iteration (s): 1.04 | learning rate: 8.478E-05 | global batch size: 256 | lm loss: 1.945113E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.039 | TFLOPs: 40.66 | 15: iteration 74580/ 125429 | consumed samples: 19092480 | consumed tokens: 39101399040 | elapsed time per iteration (s): 1.02 | learning rate: 8.475E-05 | global batch size: 256 | lm loss: 1.953577E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.784 | TFLOPs: 41.44 | 15: iteration 74590/ 125429 | consumed samples: 19095040 | consumed tokens: 39106641920 | elapsed time per iteration (s): 1.04 | learning rate: 8.473E-05 | global batch size: 256 | lm loss: 1.953914E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.820 | TFLOPs: 40.62 | 15: iteration 74600/ 125429 | consumed samples: 19097600 | consumed tokens: 39111884800 | elapsed time per iteration (s): 1.04 | learning rate: 8.471E-05 | global batch size: 256 | lm loss: 1.954965E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.858 | TFLOPs: 40.63 | 15: iteration 74610/ 125429 | consumed samples: 19100160 | consumed tokens: 39117127680 | elapsed time per iteration (s): 1.04 | learning rate: 8.469E-05 | global batch size: 256 | lm loss: 1.965515E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.979 | TFLOPs: 40.48 | 15: iteration 74620/ 125429 | consumed samples: 19102720 | consumed tokens: 39122370560 | elapsed time per iteration (s): 1.05 | learning rate: 8.467E-05 | global batch size: 256 | lm loss: 1.950260E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.148 | TFLOPs: 40.35 | 15: iteration 74630/ 125429 | consumed samples: 19105280 | consumed tokens: 39127613440 | elapsed time per iteration (s): 1.04 | learning rate: 8.464E-05 | global batch size: 256 | lm loss: 1.935943E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.084 | TFLOPs: 40.83 | 15: iteration 74640/ 125429 | consumed samples: 19107840 | consumed tokens: 39132856320 | elapsed time per iteration (s): 1.05 | learning rate: 8.462E-05 | global batch size: 256 | lm loss: 1.933888E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.248 | TFLOPs: 40.20 | 15: iteration 74650/ 125429 | consumed samples: 19110400 | consumed tokens: 39138099200 | elapsed time per iteration (s): 1.03 | learning rate: 8.460E-05 | global batch size: 256 | lm loss: 1.969436E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.091 | TFLOPs: 41.00 | 15: iteration 74660/ 125429 | consumed samples: 19112960 | consumed tokens: 39143342080 | elapsed time per iteration (s): 1.04 | learning rate: 8.458E-05 | global batch size: 256 | lm loss: 1.931233E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.356 | TFLOPs: 40.71 | 15: iteration 74670/ 125429 | consumed samples: 19115520 | consumed tokens: 39148584960 | elapsed time per iteration (s): 1.05 | learning rate: 8.456E-05 | global batch size: 256 | lm loss: 1.985575E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.135 | TFLOPs: 40.35 | 15: iteration 74680/ 125429 | consumed samples: 19118080 | consumed tokens: 39153827840 | elapsed time per iteration (s): 1.06 | learning rate: 8.454E-05 | global batch size: 256 | lm loss: 1.944931E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.512 | TFLOPs: 39.91 | 15: iteration 74690/ 125429 | consumed samples: 19120640 | consumed tokens: 39159070720 | elapsed time per iteration (s): 1.04 | learning rate: 8.451E-05 | global batch size: 256 | lm loss: 1.946689E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.023 | TFLOPs: 40.49 | 15: iteration 74700/ 125429 | consumed samples: 19123200 | consumed tokens: 39164313600 | elapsed time per iteration (s): 1.03 | learning rate: 8.449E-05 | global batch size: 256 | lm loss: 1.957898E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.351 | TFLOPs: 41.04 | 15: iteration 74710/ 125429 | consumed samples: 19125760 | consumed tokens: 39169556480 | elapsed time per iteration (s): 1.03 | learning rate: 8.447E-05 | global batch size: 256 | lm loss: 1.968027E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.393 | TFLOPs: 41.21 | 15: iteration 74720/ 125429 | consumed samples: 19128320 | consumed tokens: 39174799360 | elapsed time per iteration (s): 1.09 | learning rate: 8.445E-05 | global batch size: 256 | lm loss: 1.981125E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.984 | TFLOPs: 38.83 | 15: iteration 74730/ 125429 | consumed samples: 19130880 | consumed tokens: 39180042240 | elapsed time per iteration (s): 1.07 | learning rate: 8.443E-05 | global batch size: 256 | lm loss: 1.936533E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.502 | TFLOPs: 39.41 | 15: iteration 74740/ 125429 | consumed samples: 19133440 | consumed tokens: 39185285120 | elapsed time per iteration (s): 1.05 | learning rate: 8.440E-05 | global batch size: 256 | lm loss: 1.939713E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.857 | TFLOPs: 40.46 | 15: iteration 74750/ 125429 | consumed samples: 19136000 | consumed tokens: 39190528000 | elapsed time per iteration (s): 1.04 | learning rate: 8.438E-05 | global batch size: 256 | lm loss: 1.942743E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.131 | TFLOPs: 40.84 | 15: iteration 74760/ 125429 | consumed samples: 19138560 | consumed tokens: 39195770880 | elapsed time per iteration (s): 1.04 | learning rate: 8.436E-05 | global batch size: 256 | lm loss: 1.963013E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.787 | TFLOPs: 40.62 | 15: iteration 74770/ 125429 | consumed samples: 19141120 | consumed tokens: 39201013760 | elapsed time per iteration (s): 1.05 | learning rate: 8.434E-05 | global batch size: 256 | lm loss: 1.953440E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.252 | TFLOPs: 40.20 | 15: iteration 74780/ 125429 | consumed samples: 19143680 | consumed tokens: 39206256640 | elapsed time per iteration (s): 1.04 | learning rate: 8.432E-05 | global batch size: 256 | lm loss: 1.946943E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.046 | TFLOPs: 40.83 | 15: iteration 74790/ 125429 | consumed samples: 19146240 | consumed tokens: 39211499520 | elapsed time per iteration (s): 1.04 | learning rate: 8.430E-05 | global batch size: 256 | lm loss: 1.958753E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.410 | TFLOPs: 40.72 | 15: iteration 74800/ 125429 | consumed samples: 19148800 | consumed tokens: 39216742400 | elapsed time per iteration (s): 1.04 | learning rate: 8.427E-05 | global batch size: 256 | lm loss: 1.940547E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.255 | TFLOPs: 40.53 | 15: iteration 74810/ 125429 | consumed samples: 19151360 | consumed tokens: 39221985280 | elapsed time per iteration (s): 1.03 | learning rate: 8.425E-05 | global batch size: 256 | lm loss: 1.955921E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.243 | TFLOPs: 41.02 | 15: iteration 74820/ 125429 | consumed samples: 19153920 | consumed tokens: 39227228160 | elapsed time per iteration (s): 1.04 | learning rate: 8.423E-05 | global batch size: 256 | lm loss: 1.952396E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.079 | TFLOPs: 40.50 | 15: iteration 74830/ 125429 | consumed samples: 19156480 | consumed tokens: 39232471040 | elapsed time per iteration (s): 1.03 | learning rate: 8.421E-05 | global batch size: 256 | lm loss: 1.967768E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.036 | TFLOPs: 41.16 | 15: iteration 74840/ 125429 | consumed samples: 19159040 | consumed tokens: 39237713920 | elapsed time per iteration (s): 1.06 | learning rate: 8.419E-05 | global batch size: 256 | lm loss: 1.971402E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.185 | TFLOPs: 40.02 | 15: iteration 74850/ 125429 | consumed samples: 19161600 | consumed tokens: 39242956800 | elapsed time per iteration (s): 1.03 | learning rate: 8.416E-05 | global batch size: 256 | lm loss: 1.945827E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.914 | TFLOPs: 40.97 | 15: iteration 74860/ 125429 | consumed samples: 19164160 | consumed tokens: 39248199680 | elapsed time per iteration (s): 1.06 | learning rate: 8.414E-05 | global batch size: 256 | lm loss: 1.934552E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.746 | TFLOPs: 39.79 | 15: iteration 74870/ 125429 | consumed samples: 19166720 | consumed tokens: 39253442560 | elapsed time per iteration (s): 1.06 | learning rate: 8.412E-05 | global batch size: 256 | lm loss: 1.939924E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.901 | TFLOPs: 39.81 | 15: iteration 74880/ 125429 | consumed samples: 19169280 | consumed tokens: 39258685440 | elapsed time per iteration (s): 1.03 | learning rate: 8.410E-05 | global batch size: 256 | lm loss: 1.957495E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.674 | TFLOPs: 41.10 | 15: iteration 74890/ 125429 | consumed samples: 19171840 | consumed tokens: 39263928320 | elapsed time per iteration (s): 1.04 | learning rate: 8.408E-05 | global batch size: 256 | lm loss: 1.958631E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.024 | TFLOPs: 40.82 | 15: iteration 74900/ 125429 | consumed samples: 19174400 | consumed tokens: 39269171200 | elapsed time per iteration (s): 1.03 | learning rate: 8.406E-05 | global batch size: 256 | lm loss: 1.967091E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.995 | TFLOPs: 41.15 | 15: iteration 74910/ 125429 | consumed samples: 19176960 | consumed tokens: 39274414080 | elapsed time per iteration (s): 1.05 | learning rate: 8.403E-05 | global batch size: 256 | lm loss: 1.982777E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.982 | TFLOPs: 40.32 | 15: iteration 74920/ 125429 | consumed samples: 19179520 | consumed tokens: 39279656960 | elapsed time per iteration (s): 1.04 | learning rate: 8.401E-05 | global batch size: 256 | lm loss: 2.005957E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.721 | TFLOPs: 40.61 | 15: iteration 74930/ 125429 | consumed samples: 19182080 | consumed tokens: 39284899840 | elapsed time per iteration (s): 1.06 | learning rate: 8.399E-05 | global batch size: 256 | lm loss: 1.952950E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.739 | TFLOPs: 39.95 | 15: iteration 74940/ 125429 | consumed samples: 19184640 | consumed tokens: 39290142720 | elapsed time per iteration (s): 1.06 | learning rate: 8.397E-05 | global batch size: 256 | lm loss: 1.982955E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.506 | TFLOPs: 39.91 | 15: iteration 74950/ 125429 | consumed samples: 19187200 | consumed tokens: 39295385600 | elapsed time per iteration (s): 1.05 | learning rate: 8.395E-05 | global batch size: 256 | lm loss: 1.929362E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.750 | TFLOPs: 40.12 | 15: iteration 74960/ 125429 | consumed samples: 19189760 | consumed tokens: 39300628480 | elapsed time per iteration (s): 1.06 | learning rate: 8.392E-05 | global batch size: 256 | lm loss: 1.961897E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.274 | TFLOPs: 40.04 | 15: iteration 74970/ 125429 | consumed samples: 19192320 | consumed tokens: 39305871360 | elapsed time per iteration (s): 1.02 | learning rate: 8.390E-05 | global batch size: 256 | lm loss: 1.991563E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.841 | TFLOPs: 41.29 | 15: iteration 74980/ 125429 | consumed samples: 19194880 | consumed tokens: 39311114240 | elapsed time per iteration (s): 1.04 | learning rate: 8.388E-05 | global batch size: 256 | lm loss: 1.994635E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.133 | TFLOPs: 40.68 | 15: iteration 74990/ 125429 | consumed samples: 19197440 | consumed tokens: 39316357120 | elapsed time per iteration (s): 1.05 | learning rate: 8.386E-05 | global batch size: 256 | lm loss: 1.981968E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.338 | TFLOPs: 40.38 | 15: iteration 75000/ 125429 | consumed samples: 19200000 | consumed tokens: 39321600000 | elapsed time per iteration (s): 1.03 | learning rate: 8.384E-05 | global batch size: 256 | lm loss: 1.976214E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.603 | TFLOPs: 40.92 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 75000 | lm loss value: 2.043582E+00 | lm loss PPL: 7.718206E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 75000 to checkpoints_1b5 0: [2022-11-26 18:16:40,092] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step75000 is begin to save! 0: [2022-11-26 18:16:40,101] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:16:40,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:16:40,339] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:16:40,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:16:40,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:16:40,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:16:40,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:16:40,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:16:40,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:16:40,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:16:40,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:16:40,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:16:40,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:16:40,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:16:40,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:16:41,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:16:41,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:16:41,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:16:41,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:16:41,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:16:41,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:16:41,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:16:41,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:16:41,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:16:41,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:16:41,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:16:41,581] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:16:41,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:16:41,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:16:41,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:16:41,788] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:16:41,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:16:41,892] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:16:42,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:16:42,000] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:16:42,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:16:42,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:16:42,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:16:42,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:16:42,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:16:42,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:16:42,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:16:42,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:16:42,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:16:42,513] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:16:42,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:16:42,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:16:42,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:16:42,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:16:42,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:16:42,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:16:42,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:16:42,928] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:16:43,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:16:43,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_29-model_00-model_states.pt... 0: [2022-11-26 18:16:43,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_29-model_00-model_states.pt. 0: [2022-11-26 18:16:43,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:16:43,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:16:43,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/layer_32-model_00-model_states.pt... 0: [2022-11-26 18:16:43,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/layer_32-model_00-model_states.pt. 0: [2022-11-26 18:16:43,238] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step75000/mp_rank_00_model_states.pt 0: [2022-11-26 18:16:43,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:16:43,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:16:43,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step75000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:16:43,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 18:16:43,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 18:16:43,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 18:16:43,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 18:16:43,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 18:16:43,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 18:16:43,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:16:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:16:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:16:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:16:43,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 7: [2022-11-26 18:16:43,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:16:43,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:16:43,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:16:43,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:16:43,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 11: [2022-11-26 18:16:43,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:16:43,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:16:43,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:16:43,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 9: [2022-11-26 18:16:43,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 14: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:16:43,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:16:43,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:16:43,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:16:43,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 4: [2022-11-26 18:16:43,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 15: [2022-11-26 18:16:43,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:16:43,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 18:16:43,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 8: [2022-11-26 18:16:43,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:16:43,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 18:16:43,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:16:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 18:16:43,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 10: [2022-11-26 18:16:43,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 18:16:43,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 3: [2022-11-26 18:16:43,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:16:43,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:16:43,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 2: [2022-11-26 18:16:43,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:16:43,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 18:16:43,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: [2022-11-26 18:16:43,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:16:43,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:16:43,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 1: [2022-11-26 18:16:43,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:16:43,559] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 18:16:43,559] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 5: [2022-11-26 18:16:43,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:16:43,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 18:16:43,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:16:43,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 18:16:43,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 13: [2022-11-26 18:16:43,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 12: [2022-11-26 18:16:43,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:16:43,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step75000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 18:16:43,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 6: [2022-11-26 18:16:43,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step75000 is ready now! 0: successfully saved checkpoint at iteration 75000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3728.04 15: iteration 75010/ 125429 | consumed samples: 19202560 | consumed tokens: 39326842880 | elapsed time per iteration (s): 1.46 | learning rate: 8.382E-05 | global batch size: 256 | lm loss: 1.943416E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.354 | TFLOPs: 28.98 | 15: iteration 75020/ 125429 | consumed samples: 19205120 | consumed tokens: 39332085760 | elapsed time per iteration (s): 1.05 | learning rate: 8.379E-05 | global batch size: 256 | lm loss: 1.909155E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.751 | TFLOPs: 40.28 | 15: iteration 75030/ 125429 | consumed samples: 19207680 | consumed tokens: 39337328640 | elapsed time per iteration (s): 1.03 | learning rate: 8.377E-05 | global batch size: 256 | lm loss: 1.950337E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.737 | TFLOPs: 41.27 | 15: iteration 75040/ 125429 | consumed samples: 19210240 | consumed tokens: 39342571520 | elapsed time per iteration (s): 1.03 | learning rate: 8.375E-05 | global batch size: 256 | lm loss: 1.958820E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.998 | TFLOPs: 41.15 | 15: iteration 75050/ 125429 | consumed samples: 19212800 | consumed tokens: 39347814400 | elapsed time per iteration (s): 1.06 | learning rate: 8.373E-05 | global batch size: 256 | lm loss: 1.945820E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.385 | TFLOPs: 40.06 | 15: iteration 75060/ 125429 | consumed samples: 19215360 | consumed tokens: 39353057280 | elapsed time per iteration (s): 1.03 | learning rate: 8.371E-05 | global batch size: 256 | lm loss: 1.938731E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.913 | TFLOPs: 40.97 | 15: iteration 75070/ 125429 | consumed samples: 19217920 | consumed tokens: 39358300160 | elapsed time per iteration (s): 1.04 | learning rate: 8.369E-05 | global batch size: 256 | lm loss: 1.961144E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.247 | TFLOPs: 40.86 | 15: iteration 75080/ 125429 | consumed samples: 19220480 | consumed tokens: 39363543040 | elapsed time per iteration (s): 1.07 | learning rate: 8.366E-05 | global batch size: 256 | lm loss: 1.935384E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.072 | TFLOPs: 39.67 | 15: iteration 75090/ 125429 | consumed samples: 19223040 | consumed tokens: 39368785920 | elapsed time per iteration (s): 1.09 | learning rate: 8.364E-05 | global batch size: 256 | lm loss: 1.933847E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.185 | TFLOPs: 38.87 | 15: iteration 75100/ 125429 | consumed samples: 19225600 | consumed tokens: 39374028800 | elapsed time per iteration (s): 1.05 | learning rate: 8.362E-05 | global batch size: 256 | lm loss: 1.942296E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.279 | TFLOPs: 40.37 | 15: iteration 75110/ 125429 | consumed samples: 19228160 | consumed tokens: 39379271680 | elapsed time per iteration (s): 1.04 | learning rate: 8.360E-05 | global batch size: 256 | lm loss: 1.944369E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.250 | TFLOPs: 40.53 | 15: iteration 75120/ 125429 | consumed samples: 19230720 | consumed tokens: 39384514560 | elapsed time per iteration (s): 1.04 | learning rate: 8.358E-05 | global batch size: 256 | lm loss: 1.928687E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.202 | TFLOPs: 40.52 | 15: iteration 75130/ 125429 | consumed samples: 19233280 | consumed tokens: 39389757440 | elapsed time per iteration (s): 1.05 | learning rate: 8.355E-05 | global batch size: 256 | lm loss: 1.965894E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.853 | TFLOPs: 40.30 | 15: iteration 75140/ 125429 | consumed samples: 19235840 | consumed tokens: 39395000320 | elapsed time per iteration (s): 1.04 | learning rate: 8.353E-05 | global batch size: 256 | lm loss: 1.950634E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.917 | TFLOPs: 40.80 | 15: iteration 75150/ 125429 | consumed samples: 19238400 | consumed tokens: 39400243200 | elapsed time per iteration (s): 1.05 | learning rate: 8.351E-05 | global batch size: 256 | lm loss: 1.965085E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.683 | TFLOPs: 40.27 | 15: iteration 75160/ 125429 | consumed samples: 19240960 | consumed tokens: 39405486080 | elapsed time per iteration (s): 1.02 | learning rate: 8.349E-05 | global batch size: 256 | lm loss: 1.957133E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.788 | TFLOPs: 41.44 | 15: iteration 75170/ 125429 | consumed samples: 19243520 | consumed tokens: 39410728960 | elapsed time per iteration (s): 1.03 | learning rate: 8.347E-05 | global batch size: 256 | lm loss: 1.947320E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.745 | TFLOPs: 40.94 | 15: iteration 75180/ 125429 | consumed samples: 19246080 | consumed tokens: 39415971840 | elapsed time per iteration (s): 1.04 | learning rate: 8.345E-05 | global batch size: 256 | lm loss: 1.977646E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.350 | TFLOPs: 40.71 | 15: iteration 75190/ 125429 | consumed samples: 19248640 | consumed tokens: 39421214720 | elapsed time per iteration (s): 1.02 | learning rate: 8.342E-05 | global batch size: 256 | lm loss: 1.938881E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.805 | TFLOPs: 41.28 | 15: iteration 75200/ 125429 | consumed samples: 19251200 | consumed tokens: 39426457600 | elapsed time per iteration (s): 1.08 | learning rate: 8.340E-05 | global batch size: 256 | lm loss: 1.971684E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.557 | TFLOPs: 39.09 | 15: iteration 75210/ 125429 | consumed samples: 19253760 | consumed tokens: 39431700480 | elapsed time per iteration (s): 1.05 | learning rate: 8.338E-05 | global batch size: 256 | lm loss: 1.984546E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.305 | TFLOPs: 40.37 | 15: iteration 75220/ 125429 | consumed samples: 19256320 | consumed tokens: 39436943360 | elapsed time per iteration (s): 1.04 | learning rate: 8.336E-05 | global batch size: 256 | lm loss: 1.983709E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.566 | TFLOPs: 40.58 | 15: iteration 75230/ 125429 | consumed samples: 19258880 | consumed tokens: 39442186240 | elapsed time per iteration (s): 1.03 | learning rate: 8.334E-05 | global batch size: 256 | lm loss: 1.971929E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.500 | TFLOPs: 41.07 | 15: iteration 75240/ 125429 | consumed samples: 19261440 | consumed tokens: 39447429120 | elapsed time per iteration (s): 1.06 | learning rate: 8.332E-05 | global batch size: 256 | lm loss: 1.975102E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.550 | TFLOPs: 40.08 | 15: iteration 75250/ 125429 | consumed samples: 19264000 | consumed tokens: 39452672000 | elapsed time per iteration (s): 1.05 | learning rate: 8.329E-05 | global batch size: 256 | lm loss: 1.955750E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.147 | TFLOPs: 40.35 | 15: iteration 75260/ 125429 | consumed samples: 19266560 | consumed tokens: 39457914880 | elapsed time per iteration (s): 1.03 | learning rate: 8.327E-05 | global batch size: 256 | lm loss: 1.941615E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.324 | TFLOPs: 41.20 | 15: iteration 75270/ 125429 | consumed samples: 19269120 | consumed tokens: 39463157760 | elapsed time per iteration (s): 1.04 | learning rate: 8.325E-05 | global batch size: 256 | lm loss: 1.955163E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.004 | TFLOPs: 40.49 | 15: iteration 75280/ 125429 | consumed samples: 19271680 | consumed tokens: 39468400640 | elapsed time per iteration (s): 1.03 | learning rate: 8.323E-05 | global batch size: 256 | lm loss: 1.962604E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.762 | TFLOPs: 40.94 | 15: iteration 75290/ 125429 | consumed samples: 19274240 | consumed tokens: 39473643520 | elapsed time per iteration (s): 1.05 | learning rate: 8.321E-05 | global batch size: 256 | lm loss: 1.972443E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.801 | TFLOPs: 40.12 | 15: iteration 75300/ 125429 | consumed samples: 19276800 | consumed tokens: 39478886400 | elapsed time per iteration (s): 1.06 | learning rate: 8.318E-05 | global batch size: 256 | lm loss: 1.985597E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.544 | TFLOPs: 40.08 | 15: iteration 75310/ 125429 | consumed samples: 19279360 | consumed tokens: 39484129280 | elapsed time per iteration (s): 1.02 | learning rate: 8.316E-05 | global batch size: 256 | lm loss: 1.925259E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.419 | TFLOPs: 41.38 | 15: iteration 75320/ 125429 | consumed samples: 19281920 | consumed tokens: 39489372160 | elapsed time per iteration (s): 1.05 | learning rate: 8.314E-05 | global batch size: 256 | lm loss: 1.929471E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.060 | TFLOPs: 40.33 | 15: iteration 75330/ 125429 | consumed samples: 19284480 | consumed tokens: 39494615040 | elapsed time per iteration (s): 1.03 | learning rate: 8.312E-05 | global batch size: 256 | lm loss: 1.946627E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.898 | TFLOPs: 40.97 | 15: iteration 75340/ 125429 | consumed samples: 19287040 | consumed tokens: 39499857920 | elapsed time per iteration (s): 1.03 | learning rate: 8.310E-05 | global batch size: 256 | lm loss: 1.929946E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.795 | TFLOPs: 41.12 | 15: iteration 75350/ 125429 | consumed samples: 19289600 | consumed tokens: 39505100800 | elapsed time per iteration (s): 1.03 | learning rate: 8.308E-05 | global batch size: 256 | lm loss: 1.955659E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.759 | TFLOPs: 40.94 | 15: iteration 75360/ 125429 | consumed samples: 19292160 | consumed tokens: 39510343680 | elapsed time per iteration (s): 1.02 | learning rate: 8.305E-05 | global batch size: 256 | lm loss: 1.918511E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.477 | TFLOPs: 41.39 | 15: iteration 75370/ 125429 | consumed samples: 19294720 | consumed tokens: 39515586560 | elapsed time per iteration (s): 1.03 | learning rate: 8.303E-05 | global batch size: 256 | lm loss: 1.947364E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.496 | TFLOPs: 40.90 | 15: iteration 75380/ 125429 | consumed samples: 19297280 | consumed tokens: 39520829440 | elapsed time per iteration (s): 1.04 | learning rate: 8.301E-05 | global batch size: 256 | lm loss: 1.945854E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.834 | TFLOPs: 40.79 | 15: iteration 75390/ 125429 | consumed samples: 19299840 | consumed tokens: 39526072320 | elapsed time per iteration (s): 1.02 | learning rate: 8.299E-05 | global batch size: 256 | lm loss: 1.961376E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.402 | TFLOPs: 41.38 | 15: iteration 75400/ 125429 | consumed samples: 19302400 | consumed tokens: 39531315200 | elapsed time per iteration (s): 1.04 | learning rate: 8.297E-05 | global batch size: 256 | lm loss: 1.964561E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.887 | TFLOPs: 40.80 | 15: iteration 75410/ 125429 | consumed samples: 19304960 | consumed tokens: 39536558080 | elapsed time per iteration (s): 1.03 | learning rate: 8.295E-05 | global batch size: 256 | lm loss: 1.962014E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.631 | TFLOPs: 41.25 | 15: iteration 75420/ 125429 | consumed samples: 19307520 | consumed tokens: 39541800960 | elapsed time per iteration (s): 1.03 | learning rate: 8.292E-05 | global batch size: 256 | lm loss: 1.958470E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.944 | TFLOPs: 41.14 | 15: iteration 75430/ 125429 | consumed samples: 19310080 | consumed tokens: 39547043840 | elapsed time per iteration (s): 1.03 | learning rate: 8.290E-05 | global batch size: 256 | lm loss: 1.938298E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.729 | TFLOPs: 41.10 | 15: iteration 75440/ 125429 | consumed samples: 19312640 | consumed tokens: 39552286720 | elapsed time per iteration (s): 1.04 | learning rate: 8.288E-05 | global batch size: 256 | lm loss: 1.960139E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.914 | TFLOPs: 40.80 | 15: iteration 75450/ 125429 | consumed samples: 19315200 | consumed tokens: 39557529600 | elapsed time per iteration (s): 1.03 | learning rate: 8.286E-05 | global batch size: 256 | lm loss: 1.957790E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.660 | TFLOPs: 40.93 | 15: iteration 75460/ 125429 | consumed samples: 19317760 | consumed tokens: 39562772480 | elapsed time per iteration (s): 1.04 | learning rate: 8.284E-05 | global batch size: 256 | lm loss: 1.943458E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.160 | TFLOPs: 40.68 | 15: iteration 75470/ 125429 | consumed samples: 19320320 | consumed tokens: 39568015360 | elapsed time per iteration (s): 1.03 | learning rate: 8.282E-05 | global batch size: 256 | lm loss: 1.951649E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.084 | TFLOPs: 41.00 | 15: iteration 75480/ 125429 | consumed samples: 19322880 | consumed tokens: 39573258240 | elapsed time per iteration (s): 1.06 | learning rate: 8.279E-05 | global batch size: 256 | lm loss: 1.923269E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.977 | TFLOPs: 39.99 | 15: iteration 75490/ 125429 | consumed samples: 19325440 | consumed tokens: 39578501120 | elapsed time per iteration (s): 1.04 | learning rate: 8.277E-05 | global batch size: 256 | lm loss: 1.960239E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.372 | TFLOPs: 40.55 | 15: iteration 75500/ 125429 | consumed samples: 19328000 | consumed tokens: 39583744000 | elapsed time per iteration (s): 1.03 | learning rate: 8.275E-05 | global batch size: 256 | lm loss: 1.960342E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.390 | TFLOPs: 41.05 | 15: iteration 75510/ 125429 | consumed samples: 19330560 | consumed tokens: 39588986880 | elapsed time per iteration (s): 1.04 | learning rate: 8.273E-05 | global batch size: 256 | lm loss: 1.950086E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.219 | TFLOPs: 40.85 | 15: iteration 75520/ 125429 | consumed samples: 19333120 | consumed tokens: 39594229760 | elapsed time per iteration (s): 1.04 | learning rate: 8.271E-05 | global batch size: 256 | lm loss: 1.952092E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.305 | TFLOPs: 40.87 | 15: iteration 75530/ 125429 | consumed samples: 19335680 | consumed tokens: 39599472640 | elapsed time per iteration (s): 1.04 | learning rate: 8.269E-05 | global batch size: 256 | lm loss: 1.945654E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.844 | TFLOPs: 40.79 | 15: iteration 75540/ 125429 | consumed samples: 19338240 | consumed tokens: 39604715520 | elapsed time per iteration (s): 1.08 | learning rate: 8.266E-05 | global batch size: 256 | lm loss: 1.936649E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.596 | TFLOPs: 39.26 | 15: iteration 75550/ 125429 | consumed samples: 19340800 | consumed tokens: 39609958400 | elapsed time per iteration (s): 1.03 | learning rate: 8.264E-05 | global batch size: 256 | lm loss: 1.950344E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.864 | TFLOPs: 41.13 | 15: iteration 75560/ 125429 | consumed samples: 19343360 | consumed tokens: 39615201280 | elapsed time per iteration (s): 1.05 | learning rate: 8.262E-05 | global batch size: 256 | lm loss: 1.954379E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.929 | TFLOPs: 40.15 | 15: iteration 75570/ 125429 | consumed samples: 19345920 | consumed tokens: 39620444160 | elapsed time per iteration (s): 1.04 | learning rate: 8.260E-05 | global batch size: 256 | lm loss: 1.982782E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.287 | TFLOPs: 40.70 | 15: iteration 75580/ 125429 | consumed samples: 19348480 | consumed tokens: 39625687040 | elapsed time per iteration (s): 1.03 | learning rate: 8.258E-05 | global batch size: 256 | lm loss: 1.954033E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.420 | TFLOPs: 40.89 | 15: iteration 75590/ 125429 | consumed samples: 19351040 | consumed tokens: 39630929920 | elapsed time per iteration (s): 1.02 | learning rate: 8.256E-05 | global batch size: 256 | lm loss: 1.943976E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.116 | TFLOPs: 41.33 | 15: iteration 75600/ 125429 | consumed samples: 19353600 | consumed tokens: 39636172800 | elapsed time per iteration (s): 1.02 | learning rate: 8.253E-05 | global batch size: 256 | lm loss: 1.964423E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.835 | TFLOPs: 41.45 | 15: iteration 75610/ 125429 | consumed samples: 19356160 | consumed tokens: 39641415680 | elapsed time per iteration (s): 1.06 | learning rate: 8.251E-05 | global batch size: 256 | lm loss: 1.935428E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.680 | TFLOPs: 39.94 | 15: iteration 75620/ 125429 | consumed samples: 19358720 | consumed tokens: 39646658560 | elapsed time per iteration (s): 1.03 | learning rate: 8.249E-05 | global batch size: 256 | lm loss: 1.948080E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.365 | TFLOPs: 41.04 | 15: iteration 75630/ 125429 | consumed samples: 19361280 | consumed tokens: 39651901440 | elapsed time per iteration (s): 1.05 | learning rate: 8.247E-05 | global batch size: 256 | lm loss: 1.981058E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.950 | TFLOPs: 40.31 | 15: iteration 75640/ 125429 | consumed samples: 19363840 | consumed tokens: 39657144320 | elapsed time per iteration (s): 1.03 | learning rate: 8.245E-05 | global batch size: 256 | lm loss: 1.957547E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.502 | TFLOPs: 41.23 | 15: iteration 75650/ 125429 | consumed samples: 19366400 | consumed tokens: 39662387200 | elapsed time per iteration (s): 1.03 | learning rate: 8.242E-05 | global batch size: 256 | lm loss: 1.955461E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.042 | TFLOPs: 40.99 | 15: iteration 75660/ 125429 | consumed samples: 19368960 | consumed tokens: 39667630080 | elapsed time per iteration (s): 1.03 | learning rate: 8.240E-05 | global batch size: 256 | lm loss: 1.952463E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.715 | TFLOPs: 41.10 | 15: iteration 75670/ 125429 | consumed samples: 19371520 | consumed tokens: 39672872960 | elapsed time per iteration (s): 1.07 | learning rate: 8.238E-05 | global batch size: 256 | lm loss: 1.946802E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.966 | TFLOPs: 39.66 | 15: iteration 75680/ 125429 | consumed samples: 19374080 | consumed tokens: 39678115840 | elapsed time per iteration (s): 1.04 | learning rate: 8.236E-05 | global batch size: 256 | lm loss: 1.949503E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.219 | TFLOPs: 40.52 | 15: iteration 75690/ 125429 | consumed samples: 19376640 | consumed tokens: 39683358720 | elapsed time per iteration (s): 1.02 | learning rate: 8.234E-05 | global batch size: 256 | lm loss: 1.939089E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.780 | TFLOPs: 41.28 | 15: iteration 75700/ 125429 | consumed samples: 19379200 | consumed tokens: 39688601600 | elapsed time per iteration (s): 1.07 | learning rate: 8.232E-05 | global batch size: 256 | lm loss: 1.953093E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.946 | TFLOPs: 39.49 | 15: iteration 75710/ 125429 | consumed samples: 19381760 | consumed tokens: 39693844480 | elapsed time per iteration (s): 1.03 | learning rate: 8.229E-05 | global batch size: 256 | lm loss: 1.931833E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.223 | TFLOPs: 41.19 | 15: iteration 75720/ 125429 | consumed samples: 19384320 | consumed tokens: 39699087360 | elapsed time per iteration (s): 1.04 | learning rate: 8.227E-05 | global batch size: 256 | lm loss: 1.937464E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.278 | TFLOPs: 40.86 | 15: iteration 75730/ 125429 | consumed samples: 19386880 | consumed tokens: 39704330240 | elapsed time per iteration (s): 1.05 | learning rate: 8.225E-05 | global batch size: 256 | lm loss: 1.958224E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.048 | TFLOPs: 40.17 | 15: iteration 75740/ 125429 | consumed samples: 19389440 | consumed tokens: 39709573120 | elapsed time per iteration (s): 1.08 | learning rate: 8.223E-05 | global batch size: 256 | lm loss: 1.926576E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.131 | TFLOPs: 39.02 | 15: iteration 75750/ 125429 | consumed samples: 19392000 | consumed tokens: 39714816000 | elapsed time per iteration (s): 1.07 | learning rate: 8.221E-05 | global batch size: 256 | lm loss: 1.962414E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.664 | TFLOPs: 39.61 | 15: iteration 75760/ 125429 | consumed samples: 19394560 | consumed tokens: 39720058880 | elapsed time per iteration (s): 1.03 | learning rate: 8.219E-05 | global batch size: 256 | lm loss: 1.943484E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.056 | TFLOPs: 41.16 | 15: iteration 75770/ 125429 | consumed samples: 19397120 | consumed tokens: 39725301760 | elapsed time per iteration (s): 1.06 | learning rate: 8.216E-05 | global batch size: 256 | lm loss: 1.980166E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.982 | TFLOPs: 39.82 | 15: iteration 75780/ 125429 | consumed samples: 19399680 | consumed tokens: 39730544640 | elapsed time per iteration (s): 1.05 | learning rate: 8.214E-05 | global batch size: 256 | lm loss: 1.956748E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.628 | TFLOPs: 40.26 | 15: iteration 75790/ 125429 | consumed samples: 19402240 | consumed tokens: 39735787520 | elapsed time per iteration (s): 1.06 | learning rate: 8.212E-05 | global batch size: 256 | lm loss: 1.931280E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.175 | TFLOPs: 39.86 | 15: iteration 75800/ 125429 | consumed samples: 19404800 | consumed tokens: 39741030400 | elapsed time per iteration (s): 1.08 | learning rate: 8.210E-05 | global batch size: 256 | lm loss: 1.941435E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.882 | TFLOPs: 39.31 | 15: iteration 75810/ 125429 | consumed samples: 19407360 | consumed tokens: 39746273280 | elapsed time per iteration (s): 1.06 | learning rate: 8.208E-05 | global batch size: 256 | lm loss: 1.937358E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.099 | TFLOPs: 39.84 | 15: iteration 75820/ 125429 | consumed samples: 19409920 | consumed tokens: 39751516160 | elapsed time per iteration (s): 1.07 | learning rate: 8.206E-05 | global batch size: 256 | lm loss: 1.962962E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.731 | TFLOPs: 39.45 | 15: iteration 75830/ 125429 | consumed samples: 19412480 | consumed tokens: 39756759040 | elapsed time per iteration (s): 1.05 | learning rate: 8.204E-05 | global batch size: 256 | lm loss: 1.965361E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.970 | TFLOPs: 40.48 | 15: iteration 75840/ 125429 | consumed samples: 19415040 | consumed tokens: 39762001920 | elapsed time per iteration (s): 1.04 | learning rate: 8.201E-05 | global batch size: 256 | lm loss: 1.963353E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.352 | TFLOPs: 40.71 | 15: iteration 75850/ 125429 | consumed samples: 19417600 | consumed tokens: 39767244800 | elapsed time per iteration (s): 1.07 | learning rate: 8.199E-05 | global batch size: 256 | lm loss: 1.940814E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.101 | TFLOPs: 39.68 | 15: iteration 75860/ 125429 | consumed samples: 19420160 | consumed tokens: 39772487680 | elapsed time per iteration (s): 1.03 | learning rate: 8.197E-05 | global batch size: 256 | lm loss: 1.966582E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.346 | TFLOPs: 41.21 | 15: iteration 75870/ 125429 | consumed samples: 19422720 | consumed tokens: 39777730560 | elapsed time per iteration (s): 1.03 | learning rate: 8.195E-05 | global batch size: 256 | lm loss: 1.956171E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.079 | TFLOPs: 41.00 | 15: iteration 75880/ 125429 | consumed samples: 19425280 | consumed tokens: 39782973440 | elapsed time per iteration (s): 1.04 | learning rate: 8.193E-05 | global batch size: 256 | lm loss: 1.958403E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.546 | TFLOPs: 40.58 | 15: iteration 75890/ 125429 | consumed samples: 19427840 | consumed tokens: 39788216320 | elapsed time per iteration (s): 1.03 | learning rate: 8.191E-05 | global batch size: 256 | lm loss: 1.968567E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.033 | TFLOPs: 41.15 | 15: iteration 75900/ 125429 | consumed samples: 19430400 | consumed tokens: 39793459200 | elapsed time per iteration (s): 1.03 | learning rate: 8.188E-05 | global batch size: 256 | lm loss: 1.927948E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.745 | TFLOPs: 41.27 | 15: iteration 75910/ 125429 | consumed samples: 19432960 | consumed tokens: 39798702080 | elapsed time per iteration (s): 1.05 | learning rate: 8.186E-05 | global batch size: 256 | lm loss: 1.960605E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.765 | TFLOPs: 40.28 | 15: iteration 75920/ 125429 | consumed samples: 19435520 | consumed tokens: 39803944960 | elapsed time per iteration (s): 1.05 | learning rate: 8.184E-05 | global batch size: 256 | lm loss: 1.943475E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.537 | TFLOPs: 40.41 | 15: iteration 75930/ 125429 | consumed samples: 19438080 | consumed tokens: 39809187840 | elapsed time per iteration (s): 1.03 | learning rate: 8.182E-05 | global batch size: 256 | lm loss: 1.937741E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.286 | TFLOPs: 41.03 | 15: iteration 75940/ 125429 | consumed samples: 19440640 | consumed tokens: 39814430720 | elapsed time per iteration (s): 1.05 | learning rate: 8.180E-05 | global batch size: 256 | lm loss: 1.920694E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.851 | TFLOPs: 40.46 | 15: iteration 75950/ 125429 | consumed samples: 19443200 | consumed tokens: 39819673600 | elapsed time per iteration (s): 1.07 | learning rate: 8.178E-05 | global batch size: 256 | lm loss: 1.942978E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.387 | TFLOPs: 39.40 | 15: iteration 75960/ 125429 | consumed samples: 19445760 | consumed tokens: 39824916480 | elapsed time per iteration (s): 1.05 | learning rate: 8.175E-05 | global batch size: 256 | lm loss: 1.968281E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.069 | TFLOPs: 40.17 | 15: iteration 75970/ 125429 | consumed samples: 19448320 | consumed tokens: 39830159360 | elapsed time per iteration (s): 1.04 | learning rate: 8.173E-05 | global batch size: 256 | lm loss: 1.969512E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.641 | TFLOPs: 40.59 | 15: iteration 75980/ 125429 | consumed samples: 19450880 | consumed tokens: 39835402240 | elapsed time per iteration (s): 1.03 | learning rate: 8.171E-05 | global batch size: 256 | lm loss: 1.978332E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.986 | TFLOPs: 41.15 | 15: iteration 75990/ 125429 | consumed samples: 19453440 | consumed tokens: 39840645120 | elapsed time per iteration (s): 1.05 | learning rate: 8.169E-05 | global batch size: 256 | lm loss: 1.946033E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.410 | TFLOPs: 40.39 | 0: [2022-11-26 18:34:06,792] [INFO] [logging.py:68:log_dist] [Rank 0] step=76000, skipped=0, lr=[8.166742247324737e-05, 8.166742247324737e-05, 8.166742247324737e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: steps: 76000 loss: 1.8575 iter time (s): 1.042 samples/sec: 245.620 15: iteration 76000/ 125429 | consumed samples: 19456000 | consumed tokens: 39845888000 | elapsed time per iteration (s): 1.05 | learning rate: 8.167E-05 | global batch size: 256 | lm loss: 1.954420E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.385 | TFLOPs: 40.22 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 76000 | lm loss value: 1.834803E+00 | lm loss PPL: 6.263900E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 76000 to checkpoints_1b5 0: [2022-11-26 18:34:07,147] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step76000 is begin to save! 0: [2022-11-26 18:34:07,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:34:07,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:34:07,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:34:07,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:34:07,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:34:07,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:34:07,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:34:07,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:34:07,766] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:34:07,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:34:07,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:34:07,994] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:34:07,994] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:34:08,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:34:08,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:34:08,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:34:08,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:34:08,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:34:08,337] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:34:08,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:34:08,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:34:08,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:34:08,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:34:08,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:34:08,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:34:08,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:34:08,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:34:08,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:34:08,912] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:34:09,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:34:09,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:34:09,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:34:09,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:34:09,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:34:09,259] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:34:09,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:34:09,371] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:34:09,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:34:09,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:34:09,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:34:09,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:34:09,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:34:09,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:34:09,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:34:09,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:34:09,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:34:09,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:34:10,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:34:10,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:34:10,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:34:10,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:34:10,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:34:10,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:34:10,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:34:10,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_29-model_00-model_states.pt... 0: [2022-11-26 18:34:10,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_29-model_00-model_states.pt. 0: [2022-11-26 18:34:10,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:34:10,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:34:10,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/layer_32-model_00-model_states.pt... 0: [2022-11-26 18:34:10,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/layer_32-model_00-model_states.pt. 0: [2022-11-26 18:34:10,613] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step76000/mp_rank_00_model_states.pt 0: [2022-11-26 18:34:10,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:34:10,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:34:10,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:34:10,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step76000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:34:10,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:34:10,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:10,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:34:10,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:10,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:10,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:10,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:10,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:34:10,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:34:10,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:10,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:10,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 18:34:10,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:34:10,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:10,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:10,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:10,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:10,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:10,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:10,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:10,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:10,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:10,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:10,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:10,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:10,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 18:34:10,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:10,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:10,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:34:10,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 18:34:10,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:34:10,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:10,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 18:34:10,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 18:34:10,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:10,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:10,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:10,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:10,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:10,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:10,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:10,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:10,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:10,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:10,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:10,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:10,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:10,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:10,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:10,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:10,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:10,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 18:34:10,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 18:34:10,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 18:34:10,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:34:10,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:10,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:10,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:10,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:10,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:10,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:10,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:10,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:10,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:10,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:10,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:10,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:10,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:10,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:10,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:34:10,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:34:10,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:10,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:34:10,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:10,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:10,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:10,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 18:34:10,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:10,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:10,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 18:34:10,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 18:34:10,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:10,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:10,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:10,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:10,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:10,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:10,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:10,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:10,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:10,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:10,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:10,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:34:10,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:10,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:34:10,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:34:10,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:10,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 18:34:10,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:10,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:34:10,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 18:34:10,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:34:10,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:10,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:10,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:34:10,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 18:34:10,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:10,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 18:34:10,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:34:10,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 18:34:10,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 18:34:10,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:10,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:10,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:10,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:10,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:10,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:10,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 18:34:10,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:10,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:10,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:34:10,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:10,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:34:10,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:10,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:10,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:10,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:10,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:10,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:34:10,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:10,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:34:10,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:10,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 18:34:10,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:10,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:10,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:10,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:10,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:10,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:10,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:10,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:10,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:10,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:10,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:10,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:10,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 18:34:10,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:34:10,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:10,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:34:10,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:10,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:10,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:10,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 18:34:10,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:10,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:34:10,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:10,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:10,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:10,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 1: [2022-11-26 18:34:10,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:34:10,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 18:34:10,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 18:34:10,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:10,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:10,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:34:10,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:10,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:10,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:10,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:10,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:10,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:10,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:10,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:10,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:10,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:10,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:10,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:10,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: [2022-11-26 18:34:10,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:34:10,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:10,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:10,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:10,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:10,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:10,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:10,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:10,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:10,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:10,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:10,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:10,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:10,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:10,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:10,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:10,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 7: [2022-11-26 18:34:10,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:34:10,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 18:34:10,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:10,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:10,999] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:10,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:11,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:11,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:11,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 15: [2022-11-26 18:34:11,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:34:11,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 18:34:11,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 18:34:11,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:11,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:34:11,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 3: [2022-11-26 18:34:11,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:34:11,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:34:11,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:11,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:11,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:11,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:11,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:11,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:11,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:11,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:11,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:11,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 18:34:11,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:11,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:11,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:11,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:11,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:11,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 2: [2022-11-26 18:34:11,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:34:11,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:34:11,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:11,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:11,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:11,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 10: [2022-11-26 18:34:11,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:34:11,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:34:11,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 5: [2022-11-26 18:34:11,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:34:11,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:34:11,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 4: [2022-11-26 18:34:11,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:34:11,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:34:11,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 12: [2022-11-26 18:34:11,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:34:11,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:34:11,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 13: [2022-11-26 18:34:11,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:34:11,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 18:34:11,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 14: [2022-11-26 18:34:11,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:34:11,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:34:11,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 11: [2022-11-26 18:34:11,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:34:11,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 18:34:11,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 9: [2022-11-26 18:34:11,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:34:11,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:34:11,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 8: [2022-11-26 18:34:11,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:34:11,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 18:34:11,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 6: [2022-11-26 18:34:11,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:34:11,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step76000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:34:11,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step76000 is ready now! 0: successfully saved checkpoint at iteration 76000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4021.82 15: iteration 76010/ 125429 | consumed samples: 19458560 | consumed tokens: 39851130880 | elapsed time per iteration (s): 1.49 | learning rate: 8.165E-05 | global batch size: 256 | lm loss: 1.934675E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.246 | TFLOPs: 28.46 | 15: iteration 76020/ 125429 | consumed samples: 19461120 | consumed tokens: 39856373760 | elapsed time per iteration (s): 1.06 | learning rate: 8.162E-05 | global batch size: 256 | lm loss: 1.948660E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.518 | TFLOPs: 39.75 | 15: iteration 76030/ 125429 | consumed samples: 19463680 | consumed tokens: 39861616640 | elapsed time per iteration (s): 1.03 | learning rate: 8.160E-05 | global batch size: 256 | lm loss: 1.957331E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.777 | TFLOPs: 41.11 | 15: iteration 76040/ 125429 | consumed samples: 19466240 | consumed tokens: 39866859520 | elapsed time per iteration (s): 1.03 | learning rate: 8.158E-05 | global batch size: 256 | lm loss: 1.943788E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.878 | TFLOPs: 40.96 | 15: iteration 76050/ 125429 | consumed samples: 19468800 | consumed tokens: 39872102400 | elapsed time per iteration (s): 1.06 | learning rate: 8.156E-05 | global batch size: 256 | lm loss: 1.972663E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.646 | TFLOPs: 39.77 | 15: iteration 76060/ 125429 | consumed samples: 19471360 | consumed tokens: 39877345280 | elapsed time per iteration (s): 1.02 | learning rate: 8.154E-05 | global batch size: 256 | lm loss: 1.962427E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.299 | TFLOPs: 41.36 | 15: iteration 76070/ 125429 | consumed samples: 19473920 | consumed tokens: 39882588160 | elapsed time per iteration (s): 1.04 | learning rate: 8.152E-05 | global batch size: 256 | lm loss: 1.939663E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.843 | TFLOPs: 40.63 | 15: iteration 76080/ 125429 | consumed samples: 19476480 | consumed tokens: 39887831040 | elapsed time per iteration (s): 1.02 | learning rate: 8.149E-05 | global batch size: 256 | lm loss: 1.967839E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.761 | TFLOPs: 41.27 | 15: iteration 76090/ 125429 | consumed samples: 19479040 | consumed tokens: 39893073920 | elapsed time per iteration (s): 1.04 | learning rate: 8.147E-05 | global batch size: 256 | lm loss: 1.944038E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.264 | TFLOPs: 40.70 | 15: iteration 76100/ 125429 | consumed samples: 19481600 | consumed tokens: 39898316800 | elapsed time per iteration (s): 1.03 | learning rate: 8.145E-05 | global batch size: 256 | lm loss: 1.974880E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.458 | TFLOPs: 40.89 | 15: iteration 76110/ 125429 | consumed samples: 19484160 | consumed tokens: 39903559680 | elapsed time per iteration (s): 1.03 | learning rate: 8.143E-05 | global batch size: 256 | lm loss: 1.959022E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.999 | TFLOPs: 41.15 | 15: iteration 76120/ 125429 | consumed samples: 19486720 | consumed tokens: 39908802560 | elapsed time per iteration (s): 1.05 | learning rate: 8.141E-05 | global batch size: 256 | lm loss: 1.936266E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.374 | TFLOPs: 40.22 | 15: iteration 76130/ 125429 | consumed samples: 19489280 | consumed tokens: 39914045440 | elapsed time per iteration (s): 1.05 | learning rate: 8.139E-05 | global batch size: 256 | lm loss: 1.956627E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.305 | TFLOPs: 40.21 | 15: iteration 76140/ 125429 | consumed samples: 19491840 | consumed tokens: 39919288320 | elapsed time per iteration (s): 1.05 | learning rate: 8.137E-05 | global batch size: 256 | lm loss: 1.968576E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.718 | TFLOPs: 40.28 | 15: iteration 76150/ 125429 | consumed samples: 19494400 | consumed tokens: 39924531200 | elapsed time per iteration (s): 1.04 | learning rate: 8.134E-05 | global batch size: 256 | lm loss: 1.972284E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.766 | TFLOPs: 40.61 | 15: iteration 76160/ 125429 | consumed samples: 19496960 | consumed tokens: 39929774080 | elapsed time per iteration (s): 1.03 | learning rate: 8.132E-05 | global batch size: 256 | lm loss: 1.972129E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.792 | TFLOPs: 41.11 | 15: iteration 76170/ 125429 | consumed samples: 19499520 | consumed tokens: 39935016960 | elapsed time per iteration (s): 1.04 | learning rate: 8.130E-05 | global batch size: 256 | lm loss: 1.946917E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.735 | TFLOPs: 40.61 | 15: iteration 76180/ 125429 | consumed samples: 19502080 | consumed tokens: 39940259840 | elapsed time per iteration (s): 1.04 | learning rate: 8.128E-05 | global batch size: 256 | lm loss: 1.963742E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.526 | TFLOPs: 40.74 | 15: iteration 76190/ 125429 | consumed samples: 19504640 | consumed tokens: 39945502720 | elapsed time per iteration (s): 1.07 | learning rate: 8.126E-05 | global batch size: 256 | lm loss: 1.950871E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.075 | TFLOPs: 39.51 | 15: iteration 76200/ 125429 | consumed samples: 19507200 | consumed tokens: 39950745600 | elapsed time per iteration (s): 1.04 | learning rate: 8.124E-05 | global batch size: 256 | lm loss: 1.961172E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.153 | TFLOPs: 40.68 | 15: iteration 76210/ 125429 | consumed samples: 19509760 | consumed tokens: 39955988480 | elapsed time per iteration (s): 1.04 | learning rate: 8.121E-05 | global batch size: 256 | lm loss: 1.937653E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.260 | TFLOPs: 40.53 | 15: iteration 76220/ 125429 | consumed samples: 19512320 | consumed tokens: 39961231360 | elapsed time per iteration (s): 1.04 | learning rate: 8.119E-05 | global batch size: 256 | lm loss: 1.958495E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.419 | TFLOPs: 40.56 | 15: iteration 76230/ 125429 | consumed samples: 19514880 | consumed tokens: 39966474240 | elapsed time per iteration (s): 1.03 | learning rate: 8.117E-05 | global batch size: 256 | lm loss: 1.982897E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.447 | TFLOPs: 40.89 | 15: iteration 76240/ 125429 | consumed samples: 19517440 | consumed tokens: 39971717120 | elapsed time per iteration (s): 1.04 | learning rate: 8.115E-05 | global batch size: 256 | lm loss: 1.936843E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.998 | TFLOPs: 40.82 | 15: iteration 76250/ 125429 | consumed samples: 19520000 | consumed tokens: 39976960000 | elapsed time per iteration (s): 1.05 | learning rate: 8.113E-05 | global batch size: 256 | lm loss: 1.927342E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.006 | TFLOPs: 40.16 | 15: iteration 76260/ 125429 | consumed samples: 19522560 | consumed tokens: 39982202880 | elapsed time per iteration (s): 1.07 | learning rate: 8.111E-05 | global batch size: 256 | lm loss: 1.959095E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.855 | TFLOPs: 39.64 | 15: iteration 76270/ 125429 | consumed samples: 19525120 | consumed tokens: 39987445760 | elapsed time per iteration (s): 1.03 | learning rate: 8.108E-05 | global batch size: 256 | lm loss: 1.956588E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.999 | TFLOPs: 41.15 | 15: iteration 76280/ 125429 | consumed samples: 19527680 | consumed tokens: 39992688640 | elapsed time per iteration (s): 1.04 | learning rate: 8.106E-05 | global batch size: 256 | lm loss: 1.930331E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.431 | TFLOPs: 40.72 | 15: iteration 76290/ 125429 | consumed samples: 19530240 | consumed tokens: 39997931520 | elapsed time per iteration (s): 1.04 | learning rate: 8.104E-05 | global batch size: 256 | lm loss: 1.943305E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.787 | TFLOPs: 40.62 | 15: iteration 76300/ 125429 | consumed samples: 19532800 | consumed tokens: 40003174400 | elapsed time per iteration (s): 1.09 | learning rate: 8.102E-05 | global batch size: 256 | lm loss: 1.945258E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.187 | TFLOPs: 38.70 | 15: iteration 76310/ 125429 | consumed samples: 19535360 | consumed tokens: 40008417280 | elapsed time per iteration (s): 1.08 | learning rate: 8.100E-05 | global batch size: 256 | lm loss: 1.930164E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.523 | TFLOPs: 39.25 | 15: iteration 76320/ 125429 | consumed samples: 19537920 | consumed tokens: 40013660160 | elapsed time per iteration (s): 1.05 | learning rate: 8.098E-05 | global batch size: 256 | lm loss: 1.954997E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.510 | TFLOPs: 40.24 | 15: iteration 76330/ 125429 | consumed samples: 19540480 | consumed tokens: 40018903040 | elapsed time per iteration (s): 1.04 | learning rate: 8.096E-05 | global batch size: 256 | lm loss: 1.990343E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.682 | TFLOPs: 40.77 | 15: iteration 76340/ 125429 | consumed samples: 19543040 | consumed tokens: 40024145920 | elapsed time per iteration (s): 1.07 | learning rate: 8.093E-05 | global batch size: 256 | lm loss: 1.991664E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.973 | TFLOPs: 39.49 | 15: iteration 76350/ 125429 | consumed samples: 19545600 | consumed tokens: 40029388800 | elapsed time per iteration (s): 1.04 | learning rate: 8.091E-05 | global batch size: 256 | lm loss: 1.978638E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.270 | TFLOPs: 40.86 | 15: iteration 76360/ 125429 | consumed samples: 19548160 | consumed tokens: 40034631680 | elapsed time per iteration (s): 1.07 | learning rate: 8.089E-05 | global batch size: 256 | lm loss: 1.941034E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.366 | TFLOPs: 39.72 | 15: iteration 76370/ 125429 | consumed samples: 19550720 | consumed tokens: 40039874560 | elapsed time per iteration (s): 1.09 | learning rate: 8.087E-05 | global batch size: 256 | lm loss: 1.957525E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.553 | TFLOPs: 38.93 | 15: iteration 76380/ 125429 | consumed samples: 19553280 | consumed tokens: 40045117440 | elapsed time per iteration (s): 1.03 | learning rate: 8.085E-05 | global batch size: 256 | lm loss: 1.936975E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.095 | TFLOPs: 41.16 | 15: iteration 76390/ 125429 | consumed samples: 19555840 | consumed tokens: 40050360320 | elapsed time per iteration (s): 1.03 | learning rate: 8.083E-05 | global batch size: 256 | lm loss: 1.938798E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.034 | TFLOPs: 40.99 | 15: iteration 76400/ 125429 | consumed samples: 19558400 | consumed tokens: 40055603200 | elapsed time per iteration (s): 1.18 | learning rate: 8.080E-05 | global batch size: 256 | lm loss: 1.952145E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.302 | TFLOPs: 35.75 | 15: iteration 76410/ 125429 | consumed samples: 19560960 | consumed tokens: 40060846080 | elapsed time per iteration (s): 1.05 | learning rate: 8.078E-05 | global batch size: 256 | lm loss: 1.975064E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.537 | TFLOPs: 40.25 | 15: iteration 76420/ 125429 | consumed samples: 19563520 | consumed tokens: 40066088960 | elapsed time per iteration (s): 1.05 | learning rate: 8.076E-05 | global batch size: 256 | lm loss: 1.959445E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.083 | TFLOPs: 40.34 | 15: iteration 76430/ 125429 | consumed samples: 19566080 | consumed tokens: 40071331840 | elapsed time per iteration (s): 1.06 | learning rate: 8.074E-05 | global batch size: 256 | lm loss: 1.961466E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.539 | TFLOPs: 40.08 | 15: iteration 76440/ 125429 | consumed samples: 19568640 | consumed tokens: 40076574720 | elapsed time per iteration (s): 1.03 | learning rate: 8.072E-05 | global batch size: 256 | lm loss: 1.936980E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.032 | TFLOPs: 40.99 | 15: iteration 76450/ 125429 | consumed samples: 19571200 | consumed tokens: 40081817600 | elapsed time per iteration (s): 1.06 | learning rate: 8.070E-05 | global batch size: 256 | lm loss: 1.951567E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.704 | TFLOPs: 39.78 | 15: iteration 76460/ 125429 | consumed samples: 19573760 | consumed tokens: 40087060480 | elapsed time per iteration (s): 1.05 | learning rate: 8.068E-05 | global batch size: 256 | lm loss: 1.948175E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.209 | TFLOPs: 40.36 | 15: iteration 76470/ 125429 | consumed samples: 19576320 | consumed tokens: 40092303360 | elapsed time per iteration (s): 1.03 | learning rate: 8.065E-05 | global batch size: 256 | lm loss: 1.966078E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.634 | TFLOPs: 40.92 | 15: iteration 76480/ 125429 | consumed samples: 19578880 | consumed tokens: 40097546240 | elapsed time per iteration (s): 1.08 | learning rate: 8.063E-05 | global batch size: 256 | lm loss: 1.946054E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.156 | TFLOPs: 39.19 | 15: iteration 76490/ 125429 | consumed samples: 19581440 | consumed tokens: 40102789120 | elapsed time per iteration (s): 1.04 | learning rate: 8.061E-05 | global batch size: 256 | lm loss: 1.937323E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.486 | TFLOPs: 40.73 | 15: iteration 76500/ 125429 | consumed samples: 19584000 | consumed tokens: 40108032000 | elapsed time per iteration (s): 1.02 | learning rate: 8.059E-05 | global batch size: 256 | lm loss: 1.919720E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.709 | TFLOPs: 41.43 | 15: iteration 76510/ 125429 | consumed samples: 19586560 | consumed tokens: 40113274880 | elapsed time per iteration (s): 1.06 | learning rate: 8.057E-05 | global batch size: 256 | lm loss: 1.945620E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.674 | TFLOPs: 39.94 | 15: iteration 76520/ 125429 | consumed samples: 19589120 | consumed tokens: 40118517760 | elapsed time per iteration (s): 1.02 | learning rate: 8.055E-05 | global batch size: 256 | lm loss: 1.960398E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.622 | TFLOPs: 41.42 | 15: iteration 76530/ 125429 | consumed samples: 19591680 | consumed tokens: 40123760640 | elapsed time per iteration (s): 1.08 | learning rate: 8.052E-05 | global batch size: 256 | lm loss: 1.964338E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.006 | TFLOPs: 39.17 | 15: iteration 76540/ 125429 | consumed samples: 19594240 | consumed tokens: 40129003520 | elapsed time per iteration (s): 1.05 | learning rate: 8.050E-05 | global batch size: 256 | lm loss: 1.930421E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.645 | TFLOPs: 40.26 | 15: iteration 76550/ 125429 | consumed samples: 19596800 | consumed tokens: 40134246400 | elapsed time per iteration (s): 1.02 | learning rate: 8.048E-05 | global batch size: 256 | lm loss: 1.969320E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.963 | TFLOPs: 41.31 | 15: iteration 76560/ 125429 | consumed samples: 19599360 | consumed tokens: 40139489280 | elapsed time per iteration (s): 1.03 | learning rate: 8.046E-05 | global batch size: 256 | lm loss: 1.964781E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.062 | TFLOPs: 40.99 | 15: iteration 76570/ 125429 | consumed samples: 19601920 | consumed tokens: 40144732160 | elapsed time per iteration (s): 1.02 | learning rate: 8.044E-05 | global batch size: 256 | lm loss: 1.977427E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.340 | TFLOPs: 41.37 | 15: iteration 76580/ 125429 | consumed samples: 19604480 | consumed tokens: 40149975040 | elapsed time per iteration (s): 1.03 | learning rate: 8.042E-05 | global batch size: 256 | lm loss: 1.949164E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.543 | TFLOPs: 40.91 | 15: iteration 76590/ 125429 | consumed samples: 19607040 | consumed tokens: 40155217920 | elapsed time per iteration (s): 1.08 | learning rate: 8.040E-05 | global batch size: 256 | lm loss: 1.949058E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.787 | TFLOPs: 39.13 | 15: iteration 76600/ 125429 | consumed samples: 19609600 | consumed tokens: 40160460800 | elapsed time per iteration (s): 1.04 | learning rate: 8.037E-05 | global batch size: 256 | lm loss: 1.925004E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.981 | TFLOPs: 40.48 | 15: iteration 76610/ 125429 | consumed samples: 19612160 | consumed tokens: 40165703680 | elapsed time per iteration (s): 1.10 | learning rate: 8.035E-05 | global batch size: 256 | lm loss: 1.946393E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.740 | TFLOPs: 38.46 | 15: iteration 76620/ 125429 | consumed samples: 19614720 | consumed tokens: 40170946560 | elapsed time per iteration (s): 1.04 | learning rate: 8.033E-05 | global batch size: 256 | lm loss: 1.949791E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.341 | TFLOPs: 40.88 | 15: iteration 76630/ 125429 | consumed samples: 19617280 | consumed tokens: 40176189440 | elapsed time per iteration (s): 1.03 | learning rate: 8.031E-05 | global batch size: 256 | lm loss: 1.967363E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.281 | TFLOPs: 41.20 | 15: iteration 76640/ 125429 | consumed samples: 19619840 | consumed tokens: 40181432320 | elapsed time per iteration (s): 1.03 | learning rate: 8.029E-05 | global batch size: 256 | lm loss: 1.975958E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.762 | TFLOPs: 41.11 | 15: iteration 76650/ 125429 | consumed samples: 19622400 | consumed tokens: 40186675200 | elapsed time per iteration (s): 1.05 | learning rate: 8.027E-05 | global batch size: 256 | lm loss: 1.948307E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.296 | TFLOPs: 40.21 | 15: iteration 76660/ 125429 | consumed samples: 19624960 | consumed tokens: 40191918080 | elapsed time per iteration (s): 1.03 | learning rate: 8.025E-05 | global batch size: 256 | lm loss: 1.946867E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.667 | TFLOPs: 41.09 | 15: iteration 76670/ 125429 | consumed samples: 19627520 | consumed tokens: 40197160960 | elapsed time per iteration (s): 1.06 | learning rate: 8.022E-05 | global batch size: 256 | lm loss: 1.939196E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.652 | TFLOPs: 39.77 | 15: iteration 76680/ 125429 | consumed samples: 19630080 | consumed tokens: 40202403840 | elapsed time per iteration (s): 1.03 | learning rate: 8.020E-05 | global batch size: 256 | lm loss: 1.941412E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.890 | TFLOPs: 41.13 | 15: iteration 76690/ 125429 | consumed samples: 19632640 | consumed tokens: 40207646720 | elapsed time per iteration (s): 1.04 | learning rate: 8.018E-05 | global batch size: 256 | lm loss: 1.953311E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.214 | TFLOPs: 40.85 | 15: iteration 76700/ 125429 | consumed samples: 19635200 | consumed tokens: 40212889600 | elapsed time per iteration (s): 1.05 | learning rate: 8.016E-05 | global batch size: 256 | lm loss: 1.937337E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.909 | TFLOPs: 40.14 | 15: iteration 76710/ 125429 | consumed samples: 19637760 | consumed tokens: 40218132480 | elapsed time per iteration (s): 1.03 | learning rate: 8.014E-05 | global batch size: 256 | lm loss: 1.925107E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.103 | TFLOPs: 41.17 | 15: iteration 76720/ 125429 | consumed samples: 19640320 | consumed tokens: 40223375360 | elapsed time per iteration (s): 1.11 | learning rate: 8.012E-05 | global batch size: 256 | lm loss: 1.936989E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.493 | TFLOPs: 38.26 | 15: iteration 76730/ 125429 | consumed samples: 19642880 | consumed tokens: 40228618240 | elapsed time per iteration (s): 1.05 | learning rate: 8.009E-05 | global batch size: 256 | lm loss: 1.960287E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.614 | TFLOPs: 40.26 | 15: iteration 76740/ 125429 | consumed samples: 19645440 | consumed tokens: 40233861120 | elapsed time per iteration (s): 1.04 | learning rate: 8.007E-05 | global batch size: 256 | lm loss: 1.967364E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.988 | TFLOPs: 40.82 | 15: iteration 76750/ 125429 | consumed samples: 19648000 | consumed tokens: 40239104000 | elapsed time per iteration (s): 1.04 | learning rate: 8.005E-05 | global batch size: 256 | lm loss: 1.964981E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.736 | TFLOPs: 40.78 | 15: iteration 76760/ 125429 | consumed samples: 19650560 | consumed tokens: 40244346880 | elapsed time per iteration (s): 1.07 | learning rate: 8.003E-05 | global batch size: 256 | lm loss: 1.954986E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.207 | TFLOPs: 39.53 | 15: iteration 76770/ 125429 | consumed samples: 19653120 | consumed tokens: 40249589760 | elapsed time per iteration (s): 1.04 | learning rate: 8.001E-05 | global batch size: 256 | lm loss: 1.953973E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.174 | TFLOPs: 40.52 | 15: iteration 76780/ 125429 | consumed samples: 19655680 | consumed tokens: 40254832640 | elapsed time per iteration (s): 1.03 | learning rate: 7.999E-05 | global batch size: 256 | lm loss: 1.932161E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.230 | TFLOPs: 41.02 | 15: iteration 76790/ 125429 | consumed samples: 19658240 | consumed tokens: 40260075520 | elapsed time per iteration (s): 1.04 | learning rate: 7.997E-05 | global batch size: 256 | lm loss: 1.942085E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.811 | TFLOPs: 40.79 | 15: iteration 76800/ 125429 | consumed samples: 19660800 | consumed tokens: 40265318400 | elapsed time per iteration (s): 1.03 | learning rate: 7.994E-05 | global batch size: 256 | lm loss: 1.939370E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.182 | TFLOPs: 41.01 | 15: iteration 76810/ 125429 | consumed samples: 19663360 | consumed tokens: 40270561280 | elapsed time per iteration (s): 1.05 | learning rate: 7.992E-05 | global batch size: 256 | lm loss: 1.942588E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.770 | TFLOPs: 40.28 | 15: iteration 76820/ 125429 | consumed samples: 19665920 | consumed tokens: 40275804160 | elapsed time per iteration (s): 1.04 | learning rate: 7.990E-05 | global batch size: 256 | lm loss: 1.946905E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.718 | TFLOPs: 40.77 | 15: iteration 76830/ 125429 | consumed samples: 19668480 | consumed tokens: 40281047040 | elapsed time per iteration (s): 1.06 | learning rate: 7.988E-05 | global batch size: 256 | lm loss: 1.980901E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.431 | TFLOPs: 39.90 | 15: iteration 76840/ 125429 | consumed samples: 19671040 | consumed tokens: 40286289920 | elapsed time per iteration (s): 1.06 | learning rate: 7.986E-05 | global batch size: 256 | lm loss: 1.942810E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.684 | TFLOPs: 39.77 | 15: iteration 76850/ 125429 | consumed samples: 19673600 | consumed tokens: 40291532800 | elapsed time per iteration (s): 1.04 | learning rate: 7.984E-05 | global batch size: 256 | lm loss: 1.964820E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.659 | TFLOPs: 40.60 | 15: iteration 76860/ 125429 | consumed samples: 19676160 | consumed tokens: 40296775680 | elapsed time per iteration (s): 1.02 | learning rate: 7.982E-05 | global batch size: 256 | lm loss: 1.946916E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.429 | TFLOPs: 41.55 | 15: iteration 76870/ 125429 | consumed samples: 19678720 | consumed tokens: 40302018560 | elapsed time per iteration (s): 1.07 | learning rate: 7.979E-05 | global batch size: 256 | lm loss: 1.974339E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.022 | TFLOPs: 39.67 | 15: iteration 76880/ 125429 | consumed samples: 19681280 | consumed tokens: 40307261440 | elapsed time per iteration (s): 1.04 | learning rate: 7.977E-05 | global batch size: 256 | lm loss: 1.955183E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.880 | TFLOPs: 40.80 | 15: iteration 76890/ 125429 | consumed samples: 19683840 | consumed tokens: 40312504320 | elapsed time per iteration (s): 1.04 | learning rate: 7.975E-05 | global batch size: 256 | lm loss: 1.959944E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.577 | TFLOPs: 40.58 | 15: iteration 76900/ 125429 | consumed samples: 19686400 | consumed tokens: 40317747200 | elapsed time per iteration (s): 1.05 | learning rate: 7.973E-05 | global batch size: 256 | lm loss: 1.921615E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.770 | TFLOPs: 40.28 | 15: iteration 76910/ 125429 | consumed samples: 19688960 | consumed tokens: 40322990080 | elapsed time per iteration (s): 1.09 | learning rate: 7.971E-05 | global batch size: 256 | lm loss: 1.967405E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.214 | TFLOPs: 38.87 | 15: iteration 76920/ 125429 | consumed samples: 19691520 | consumed tokens: 40328232960 | elapsed time per iteration (s): 1.05 | learning rate: 7.969E-05 | global batch size: 256 | lm loss: 1.972066E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.945 | TFLOPs: 40.31 | 15: iteration 76930/ 125429 | consumed samples: 19694080 | consumed tokens: 40333475840 | elapsed time per iteration (s): 1.02 | learning rate: 7.967E-05 | global batch size: 256 | lm loss: 1.987127E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.621 | TFLOPs: 41.42 | 15: iteration 76940/ 125429 | consumed samples: 19696640 | consumed tokens: 40338718720 | elapsed time per iteration (s): 1.05 | learning rate: 7.964E-05 | global batch size: 256 | lm loss: 1.945274E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.511 | TFLOPs: 40.41 | 15: iteration 76950/ 125429 | consumed samples: 19699200 | consumed tokens: 40343961600 | elapsed time per iteration (s): 1.04 | learning rate: 7.962E-05 | global batch size: 256 | lm loss: 1.955516E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.382 | TFLOPs: 40.72 | 15: iteration 76960/ 125429 | consumed samples: 19701760 | consumed tokens: 40349204480 | elapsed time per iteration (s): 1.06 | learning rate: 7.960E-05 | global batch size: 256 | lm loss: 1.949865E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.847 | TFLOPs: 39.80 | 15: iteration 76970/ 125429 | consumed samples: 19704320 | consumed tokens: 40354447360 | elapsed time per iteration (s): 1.04 | learning rate: 7.958E-05 | global batch size: 256 | lm loss: 1.947248E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.157 | TFLOPs: 40.84 | 15: iteration 76980/ 125429 | consumed samples: 19706880 | consumed tokens: 40359690240 | elapsed time per iteration (s): 1.03 | learning rate: 7.956E-05 | global batch size: 256 | lm loss: 1.957011E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.442 | TFLOPs: 41.06 | 15: iteration 76990/ 125429 | consumed samples: 19709440 | consumed tokens: 40364933120 | elapsed time per iteration (s): 1.03 | learning rate: 7.954E-05 | global batch size: 256 | lm loss: 1.939019E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.785 | TFLOPs: 41.11 | 15: iteration 77000/ 125429 | consumed samples: 19712000 | consumed tokens: 40370176000 | elapsed time per iteration (s): 1.03 | learning rate: 7.952E-05 | global batch size: 256 | lm loss: 1.975219E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.472 | TFLOPs: 41.23 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 77000 | lm loss value: 1.891811E+00 | lm loss PPL: 6.631368E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 77000 to checkpoints_1b5 0: [2022-11-26 18:51:38,969] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step77000 is begin to save! 0: [2022-11-26 18:51:38,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_01-model_00-model_states.pt... 0: [2022-11-26 18:51:39,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_01-model_00-model_states.pt. 0: [2022-11-26 18:51:39,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_03-model_00-model_states.pt... 0: [2022-11-26 18:51:39,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_03-model_00-model_states.pt. 0: [2022-11-26 18:51:39,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_04-model_00-model_states.pt... 0: [2022-11-26 18:51:39,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_04-model_00-model_states.pt. 0: [2022-11-26 18:51:39,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_05-model_00-model_states.pt... 0: [2022-11-26 18:51:39,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_05-model_00-model_states.pt. 0: [2022-11-26 18:51:39,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_06-model_00-model_states.pt... 0: [2022-11-26 18:51:39,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_06-model_00-model_states.pt. 0: [2022-11-26 18:51:39,650] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_07-model_00-model_states.pt... 0: [2022-11-26 18:51:39,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_07-model_00-model_states.pt. 0: [2022-11-26 18:51:39,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_08-model_00-model_states.pt... 0: [2022-11-26 18:51:39,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_08-model_00-model_states.pt. 0: [2022-11-26 18:51:39,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_09-model_00-model_states.pt... 0: [2022-11-26 18:51:39,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_09-model_00-model_states.pt. 0: [2022-11-26 18:51:39,983] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_10-model_00-model_states.pt... 0: [2022-11-26 18:51:40,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_10-model_00-model_states.pt. 0: [2022-11-26 18:51:40,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_11-model_00-model_states.pt... 0: [2022-11-26 18:51:40,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_11-model_00-model_states.pt. 0: [2022-11-26 18:51:40,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_12-model_00-model_states.pt... 0: [2022-11-26 18:51:40,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_12-model_00-model_states.pt. 0: [2022-11-26 18:51:40,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_13-model_00-model_states.pt... 0: [2022-11-26 18:51:40,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_13-model_00-model_states.pt. 0: [2022-11-26 18:51:40,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_14-model_00-model_states.pt... 0: [2022-11-26 18:51:40,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_14-model_00-model_states.pt. 0: [2022-11-26 18:51:40,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_15-model_00-model_states.pt... 0: [2022-11-26 18:51:40,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_15-model_00-model_states.pt. 0: [2022-11-26 18:51:40,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_16-model_00-model_states.pt... 0: [2022-11-26 18:51:40,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_16-model_00-model_states.pt. 0: [2022-11-26 18:51:40,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_17-model_00-model_states.pt... 0: [2022-11-26 18:51:40,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_17-model_00-model_states.pt. 0: [2022-11-26 18:51:40,859] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_18-model_00-model_states.pt... 0: [2022-11-26 18:51:40,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_18-model_00-model_states.pt. 0: [2022-11-26 18:51:40,972] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_19-model_00-model_states.pt... 0: [2022-11-26 18:51:41,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_19-model_00-model_states.pt. 0: [2022-11-26 18:51:41,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_20-model_00-model_states.pt... 0: [2022-11-26 18:51:41,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_20-model_00-model_states.pt. 0: [2022-11-26 18:51:41,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_21-model_00-model_states.pt... 0: [2022-11-26 18:51:41,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_21-model_00-model_states.pt. 0: [2022-11-26 18:51:41,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_22-model_00-model_states.pt... 0: [2022-11-26 18:51:41,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_22-model_00-model_states.pt. 0: [2022-11-26 18:51:41,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_23-model_00-model_states.pt... 0: [2022-11-26 18:51:41,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_23-model_00-model_states.pt. 0: [2022-11-26 18:51:41,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_24-model_00-model_states.pt... 0: [2022-11-26 18:51:41,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_24-model_00-model_states.pt. 0: [2022-11-26 18:51:41,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_25-model_00-model_states.pt... 0: [2022-11-26 18:51:41,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_25-model_00-model_states.pt. 0: [2022-11-26 18:51:41,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_26-model_00-model_states.pt... 0: [2022-11-26 18:51:41,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_26-model_00-model_states.pt. 0: [2022-11-26 18:51:41,866] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_27-model_00-model_states.pt... 0: [2022-11-26 18:51:41,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_27-model_00-model_states.pt. 0: [2022-11-26 18:51:41,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_28-model_00-model_states.pt... 0: [2022-11-26 18:51:42,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_28-model_00-model_states.pt. 0: [2022-11-26 18:51:42,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_29-model_00-model_states.pt... 0: [2022-11-26 18:51:42,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_29-model_00-model_states.pt. 0: [2022-11-26 18:51:42,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_30-model_00-model_states.pt... 0: [2022-11-26 18:51:42,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_30-model_00-model_states.pt. 0: [2022-11-26 18:51:42,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/layer_32-model_00-model_states.pt... 0: [2022-11-26 18:51:42,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/layer_32-model_00-model_states.pt. 0: [2022-11-26 18:51:42,295] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step77000/mp_rank_00_model_states.pt 0: [2022-11-26 18:51:42,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/mp_rank_00_model_states.pt... 0: [2022-11-26 18:51:42,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/mp_rank_00_model_states.pt. 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 18:51:42,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step77000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 12: [2022-11-26 18:51:42,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:51:42,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 18:51:42,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:51:42,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:51:42,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:51:42,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 18:51:42,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 18:51:42,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:51:42,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 18:51:42,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 18:51:42,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 18:51:42,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 18:51:42,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 18:51:42,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 18:51:42,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:51:42,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:51:42,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 18:51:42,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 18:51:42,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 18:51:42,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:51:42,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 18:51:42,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:51:42,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 8: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:51:42,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:51:42,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:51:42,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:51:42,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:51:42,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,518] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 18:51:42,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 18:51:42,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 18:51:42,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 18:51:42,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,525] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,525] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:51:42,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 18:51:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 18:51:42,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 6: [2022-11-26 18:51:42,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 6: [2022-11-26 18:51:42,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 18:51:42,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 18:51:42,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 18:51:42,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 2: [2022-11-26 18:51:42,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 18:51:42,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 18:51:42,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 18:51:42,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 18:51:42,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 18:51:42,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 8: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 3: [2022-11-26 18:51:42,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-26 18:51:42,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 8: [2022-11-26 18:51:42,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 18:51:42,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 13: [2022-11-26 18:51:42,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 18:51:42,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 18:51:42,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 18:51:42,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 18:51:42,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 5: [2022-11-26 18:51:42,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 18:51:42,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 18:51:42,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 14: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 18:51:42,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 3: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 18:51:42,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 7: [2022-11-26 18:51:42,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 18:51:42,543] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 18:51:42,543] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:51:42,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:51:42,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 11: [2022-11-26 18:51:42,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:51:42,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 11: [2022-11-26 18:51:42,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 11: [2022-11-26 18:51:42,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:51:42,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 15: [2022-11-26 18:51:42,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 18:51:42,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 18:51:42,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 9: [2022-11-26 18:51:42,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 18:51:42,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 18:51:42,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 10: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 10: [2022-11-26 18:51:42,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 10: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 12: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 18:51:42,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 18:51:42,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:51:42,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:51:42,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 18:51:42,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 18:51:42,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 1: [2022-11-26 18:51:42,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: [2022-11-26 18:51:42,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 18:51:42,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step77000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 4: [2022-11-26 18:51:42,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step77000 is ready now! 0: successfully saved checkpoint at iteration 77000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3794.49 15: iteration 77010/ 125429 | consumed samples: 19714560 | consumed tokens: 40375418880 | elapsed time per iteration (s): 1.45 | learning rate: 7.949E-05 | global batch size: 256 | lm loss: 1.918096E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.669 | TFLOPs: 29.20 | 15: iteration 77020/ 125429 | consumed samples: 19717120 | consumed tokens: 40380661760 | elapsed time per iteration (s): 1.07 | learning rate: 7.947E-05 | global batch size: 256 | lm loss: 1.967703E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.548 | TFLOPs: 39.59 | 15: iteration 77030/ 125429 | consumed samples: 19719680 | consumed tokens: 40385904640 | elapsed time per iteration (s): 1.07 | learning rate: 7.945E-05 | global batch size: 256 | lm loss: 1.940878E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.215 | TFLOPs: 39.70 | 15: iteration 77040/ 125429 | consumed samples: 19722240 | consumed tokens: 40391147520 | elapsed time per iteration (s): 1.04 | learning rate: 7.943E-05 | global batch size: 256 | lm loss: 1.961363E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.559 | TFLOPs: 40.75 | 15: iteration 77050/ 125429 | consumed samples: 19724800 | consumed tokens: 40396390400 | elapsed time per iteration (s): 1.02 | learning rate: 7.941E-05 | global batch size: 256 | lm loss: 1.949593E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.885 | TFLOPs: 41.46 | 15: iteration 77060/ 125429 | consumed samples: 19727360 | consumed tokens: 40401633280 | elapsed time per iteration (s): 1.15 | learning rate: 7.939E-05 | global batch size: 256 | lm loss: 1.953561E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.045 | TFLOPs: 36.69 | 15: iteration 77070/ 125429 | consumed samples: 19729920 | consumed tokens: 40406876160 | elapsed time per iteration (s): 1.05 | learning rate: 7.937E-05 | global batch size: 256 | lm loss: 1.936296E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.997 | TFLOPs: 40.16 | 15: iteration 77080/ 125429 | consumed samples: 19732480 | consumed tokens: 40412119040 | elapsed time per iteration (s): 1.04 | learning rate: 7.934E-05 | global batch size: 256 | lm loss: 1.963873E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.597 | TFLOPs: 40.75 | 15: iteration 77090/ 125429 | consumed samples: 19735040 | consumed tokens: 40417361920 | elapsed time per iteration (s): 1.15 | learning rate: 7.932E-05 | global batch size: 256 | lm loss: 1.947049E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.143 | TFLOPs: 36.71 | 15: iteration 77100/ 125429 | consumed samples: 19737600 | consumed tokens: 40422604800 | elapsed time per iteration (s): 1.06 | learning rate: 7.930E-05 | global batch size: 256 | lm loss: 1.956490E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.424 | TFLOPs: 39.73 | 15: iteration 77110/ 125429 | consumed samples: 19740160 | consumed tokens: 40427847680 | elapsed time per iteration (s): 1.06 | learning rate: 7.928E-05 | global batch size: 256 | lm loss: 1.931555E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.762 | TFLOPs: 39.95 | 15: iteration 77120/ 125429 | consumed samples: 19742720 | consumed tokens: 40433090560 | elapsed time per iteration (s): 1.02 | learning rate: 7.926E-05 | global batch size: 256 | lm loss: 1.988734E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.054 | TFLOPs: 41.32 | 15: iteration 77130/ 125429 | consumed samples: 19745280 | consumed tokens: 40438333440 | elapsed time per iteration (s): 1.02 | learning rate: 7.924E-05 | global batch size: 256 | lm loss: 1.941230E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.114 | TFLOPs: 41.50 | 15: iteration 77140/ 125429 | consumed samples: 19747840 | consumed tokens: 40443576320 | elapsed time per iteration (s): 1.04 | learning rate: 7.922E-05 | global batch size: 256 | lm loss: 1.959042E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.797 | TFLOPs: 40.62 | 15: iteration 77150/ 125429 | consumed samples: 19750400 | consumed tokens: 40448819200 | elapsed time per iteration (s): 1.05 | learning rate: 7.919E-05 | global batch size: 256 | lm loss: 1.938207E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.786 | TFLOPs: 40.12 | 15: iteration 77160/ 125429 | consumed samples: 19752960 | consumed tokens: 40454062080 | elapsed time per iteration (s): 1.06 | learning rate: 7.917E-05 | global batch size: 256 | lm loss: 1.952915E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.634 | TFLOPs: 40.10 | 15: iteration 77170/ 125429 | consumed samples: 19755520 | consumed tokens: 40459304960 | elapsed time per iteration (s): 1.04 | learning rate: 7.915E-05 | global batch size: 256 | lm loss: 1.952844E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.840 | TFLOPs: 40.79 | 15: iteration 77180/ 125429 | consumed samples: 19758080 | consumed tokens: 40464547840 | elapsed time per iteration (s): 1.03 | learning rate: 7.913E-05 | global batch size: 256 | lm loss: 1.943919E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.743 | TFLOPs: 41.27 | 15: iteration 77190/ 125429 | consumed samples: 19760640 | consumed tokens: 40469790720 | elapsed time per iteration (s): 1.03 | learning rate: 7.911E-05 | global batch size: 256 | lm loss: 1.930956E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.143 | TFLOPs: 41.17 | 15: iteration 77200/ 125429 | consumed samples: 19763200 | consumed tokens: 40475033600 | elapsed time per iteration (s): 1.02 | learning rate: 7.909E-05 | global batch size: 256 | lm loss: 1.934892E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.793 | TFLOPs: 41.28 | 15: iteration 77210/ 125429 | consumed samples: 19765760 | consumed tokens: 40480276480 | elapsed time per iteration (s): 1.03 | learning rate: 7.907E-05 | global batch size: 256 | lm loss: 1.953253E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.334 | TFLOPs: 41.20 | 15: iteration 77220/ 125429 | consumed samples: 19768320 | consumed tokens: 40485519360 | elapsed time per iteration (s): 1.02 | learning rate: 7.904E-05 | global batch size: 256 | lm loss: 1.942949E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.042 | TFLOPs: 41.32 | 15: iteration 77230/ 125429 | consumed samples: 19770880 | consumed tokens: 40490762240 | elapsed time per iteration (s): 1.03 | learning rate: 7.902E-05 | global batch size: 256 | lm loss: 1.961333E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.897 | TFLOPs: 41.13 | 15: iteration 77240/ 125429 | consumed samples: 19773440 | consumed tokens: 40496005120 | elapsed time per iteration (s): 1.05 | learning rate: 7.900E-05 | global batch size: 256 | lm loss: 1.957392E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.946 | TFLOPs: 40.31 | 15: iteration 77250/ 125429 | consumed samples: 19776000 | consumed tokens: 40501248000 | elapsed time per iteration (s): 1.04 | learning rate: 7.898E-05 | global batch size: 256 | lm loss: 1.972607E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.175 | TFLOPs: 40.85 | 15: iteration 77260/ 125429 | consumed samples: 19778560 | consumed tokens: 40506490880 | elapsed time per iteration (s): 1.04 | learning rate: 7.896E-05 | global batch size: 256 | lm loss: 1.929794E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.798 | TFLOPs: 40.62 | 15: iteration 77270/ 125429 | consumed samples: 19781120 | consumed tokens: 40511733760 | elapsed time per iteration (s): 1.04 | learning rate: 7.894E-05 | global batch size: 256 | lm loss: 1.964909E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.463 | TFLOPs: 40.56 | 15: iteration 77280/ 125429 | consumed samples: 19783680 | consumed tokens: 40516976640 | elapsed time per iteration (s): 1.03 | learning rate: 7.892E-05 | global batch size: 256 | lm loss: 1.938367E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.460 | TFLOPs: 41.06 | 15: iteration 77290/ 125429 | consumed samples: 19786240 | consumed tokens: 40522219520 | elapsed time per iteration (s): 1.16 | learning rate: 7.890E-05 | global batch size: 256 | lm loss: 1.983332E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.717 | TFLOPs: 36.48 | 15: iteration 77300/ 125429 | consumed samples: 19788800 | consumed tokens: 40527462400 | elapsed time per iteration (s): 1.02 | learning rate: 7.887E-05 | global batch size: 256 | lm loss: 1.959400E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.371 | TFLOPs: 41.38 | 15: iteration 77310/ 125429 | consumed samples: 19791360 | consumed tokens: 40532705280 | elapsed time per iteration (s): 1.04 | learning rate: 7.885E-05 | global batch size: 256 | lm loss: 1.913802E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.553 | TFLOPs: 40.74 | 15: iteration 77320/ 125429 | consumed samples: 19793920 | consumed tokens: 40537948160 | elapsed time per iteration (s): 1.02 | learning rate: 7.883E-05 | global batch size: 256 | lm loss: 1.972562E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.782 | TFLOPs: 41.28 | 15: iteration 77330/ 125429 | consumed samples: 19796480 | consumed tokens: 40543191040 | elapsed time per iteration (s): 1.04 | learning rate: 7.881E-05 | global batch size: 256 | lm loss: 1.967724E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.175 | TFLOPs: 40.85 | 15: iteration 77340/ 125429 | consumed samples: 19799040 | consumed tokens: 40548433920 | elapsed time per iteration (s): 1.04 | learning rate: 7.879E-05 | global batch size: 256 | lm loss: 1.964832E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.338 | TFLOPs: 40.87 | 15: iteration 77350/ 125429 | consumed samples: 19801600 | consumed tokens: 40553676800 | elapsed time per iteration (s): 1.04 | learning rate: 7.877E-05 | global batch size: 256 | lm loss: 1.941999E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.380 | TFLOPs: 40.72 | 15: iteration 77360/ 125429 | consumed samples: 19804160 | consumed tokens: 40558919680 | elapsed time per iteration (s): 1.02 | learning rate: 7.875E-05 | global batch size: 256 | lm loss: 1.955938E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.692 | TFLOPs: 41.59 | 15: iteration 77370/ 125429 | consumed samples: 19806720 | consumed tokens: 40564162560 | elapsed time per iteration (s): 1.03 | learning rate: 7.872E-05 | global batch size: 256 | lm loss: 1.973982E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.654 | TFLOPs: 41.26 | 15: iteration 77380/ 125429 | consumed samples: 19809280 | consumed tokens: 40569405440 | elapsed time per iteration (s): 1.05 | learning rate: 7.870E-05 | global batch size: 256 | lm loss: 1.950220E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.882 | TFLOPs: 40.14 | 15: iteration 77390/ 125429 | consumed samples: 19811840 | consumed tokens: 40574648320 | elapsed time per iteration (s): 1.17 | learning rate: 7.868E-05 | global batch size: 256 | lm loss: 1.936260E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.801 | TFLOPs: 36.16 | 15: iteration 77400/ 125429 | consumed samples: 19814400 | consumed tokens: 40579891200 | elapsed time per iteration (s): 1.38 | learning rate: 7.866E-05 | global batch size: 256 | lm loss: 1.955181E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 185.362 | TFLOPs: 30.63 | 15: iteration 77410/ 125429 | consumed samples: 19816960 | consumed tokens: 40585134080 | elapsed time per iteration (s): 1.23 | learning rate: 7.864E-05 | global batch size: 256 | lm loss: 1.920941E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 207.615 | TFLOPs: 34.31 | 15: iteration 77420/ 125429 | consumed samples: 19819520 | consumed tokens: 40590376960 | elapsed time per iteration (s): 1.02 | learning rate: 7.862E-05 | global batch size: 256 | lm loss: 1.954668E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.915 | TFLOPs: 41.63 | 15: iteration 77430/ 125429 | consumed samples: 19822080 | consumed tokens: 40595619840 | elapsed time per iteration (s): 1.46 | learning rate: 7.860E-05 | global batch size: 256 | lm loss: 1.949656E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.913 | TFLOPs: 28.91 | 15: iteration 77440/ 125429 | consumed samples: 19824640 | consumed tokens: 40600862720 | elapsed time per iteration (s): 1.02 | learning rate: 7.857E-05 | global batch size: 256 | lm loss: 1.953189E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.052 | TFLOPs: 41.32 | 15: iteration 77450/ 125429 | consumed samples: 19827200 | consumed tokens: 40606105600 | elapsed time per iteration (s): 1.03 | learning rate: 7.855E-05 | global batch size: 256 | lm loss: 1.907318E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.956 | TFLOPs: 40.98 | 15: iteration 77460/ 125429 | consumed samples: 19829760 | consumed tokens: 40611348480 | elapsed time per iteration (s): 1.05 | learning rate: 7.853E-05 | global batch size: 256 | lm loss: 1.947408E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.624 | TFLOPs: 40.26 | 15: iteration 77470/ 125429 | consumed samples: 19832320 | consumed tokens: 40616591360 | elapsed time per iteration (s): 1.05 | learning rate: 7.851E-05 | global batch size: 256 | lm loss: 1.951755E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.159 | TFLOPs: 40.18 | 15: iteration 77480/ 125429 | consumed samples: 19834880 | consumed tokens: 40621834240 | elapsed time per iteration (s): 1.03 | learning rate: 7.849E-05 | global batch size: 256 | lm loss: 1.955329E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.863 | TFLOPs: 41.13 | 15: iteration 77490/ 125429 | consumed samples: 19837440 | consumed tokens: 40627077120 | elapsed time per iteration (s): 1.06 | learning rate: 7.847E-05 | global batch size: 256 | lm loss: 1.969757E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.244 | TFLOPs: 40.03 | 15: iteration 77500/ 125429 | consumed samples: 19840000 | consumed tokens: 40632320000 | elapsed time per iteration (s): 1.02 | learning rate: 7.845E-05 | global batch size: 256 | lm loss: 1.925718E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.154 | TFLOPs: 41.34 | 15: iteration 77510/ 125429 | consumed samples: 19842560 | consumed tokens: 40637562880 | elapsed time per iteration (s): 1.06 | learning rate: 7.843E-05 | global batch size: 256 | lm loss: 1.958768E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.356 | TFLOPs: 39.89 | 15: iteration 77520/ 125429 | consumed samples: 19845120 | consumed tokens: 40642805760 | elapsed time per iteration (s): 1.02 | learning rate: 7.840E-05 | global batch size: 256 | lm loss: 1.964037E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.477 | TFLOPs: 41.56 | 15: iteration 77530/ 125429 | consumed samples: 19847680 | consumed tokens: 40648048640 | elapsed time per iteration (s): 1.03 | learning rate: 7.838E-05 | global batch size: 256 | lm loss: 1.949878E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.533 | TFLOPs: 41.07 | 15: iteration 77540/ 125429 | consumed samples: 19850240 | consumed tokens: 40653291520 | elapsed time per iteration (s): 1.02 | learning rate: 7.836E-05 | global batch size: 256 | lm loss: 1.940326E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.874 | TFLOPs: 41.29 | 15: iteration 77550/ 125429 | consumed samples: 19852800 | consumed tokens: 40658534400 | elapsed time per iteration (s): 1.06 | learning rate: 7.834E-05 | global batch size: 256 | lm loss: 1.945151E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.549 | TFLOPs: 39.92 | 15: iteration 77560/ 125429 | consumed samples: 19855360 | consumed tokens: 40663777280 | elapsed time per iteration (s): 1.03 | learning rate: 7.832E-05 | global batch size: 256 | lm loss: 1.947363E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.877 | TFLOPs: 40.96 | 15: iteration 77570/ 125429 | consumed samples: 19857920 | consumed tokens: 40669020160 | elapsed time per iteration (s): 1.24 | learning rate: 7.830E-05 | global batch size: 256 | lm loss: 1.965921E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 206.131 | TFLOPs: 34.06 | 15: iteration 77580/ 125429 | consumed samples: 19860480 | consumed tokens: 40674263040 | elapsed time per iteration (s): 1.04 | learning rate: 7.828E-05 | global batch size: 256 | lm loss: 1.974205E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.859 | TFLOPs: 40.63 | 15: iteration 77590/ 125429 | consumed samples: 19863040 | consumed tokens: 40679505920 | elapsed time per iteration (s): 1.05 | learning rate: 7.825E-05 | global batch size: 256 | lm loss: 1.935073E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.433 | TFLOPs: 40.23 | 15: iteration 77600/ 125429 | consumed samples: 19865600 | consumed tokens: 40684748800 | elapsed time per iteration (s): 1.09 | learning rate: 7.823E-05 | global batch size: 256 | lm loss: 1.949049E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.736 | TFLOPs: 38.96 | 15: iteration 77610/ 125429 | consumed samples: 19868160 | consumed tokens: 40689991680 | elapsed time per iteration (s): 1.07 | learning rate: 7.821E-05 | global batch size: 256 | lm loss: 1.937834E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.042 | TFLOPs: 39.67 | 15: iteration 77620/ 125429 | consumed samples: 19870720 | consumed tokens: 40695234560 | elapsed time per iteration (s): 1.04 | learning rate: 7.819E-05 | global batch size: 256 | lm loss: 1.971887E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.021 | TFLOPs: 40.82 | 15: iteration 77630/ 125429 | consumed samples: 19873280 | consumed tokens: 40700477440 | elapsed time per iteration (s): 1.08 | learning rate: 7.817E-05 | global batch size: 256 | lm loss: 1.971757E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.955 | TFLOPs: 39.32 | 15: iteration 77640/ 125429 | consumed samples: 19875840 | consumed tokens: 40705720320 | elapsed time per iteration (s): 1.08 | learning rate: 7.815E-05 | global batch size: 256 | lm loss: 1.934349E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.372 | TFLOPs: 39.23 | 15: iteration 77650/ 125429 | consumed samples: 19878400 | consumed tokens: 40710963200 | elapsed time per iteration (s): 1.03 | learning rate: 7.813E-05 | global batch size: 256 | lm loss: 1.965775E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.943 | TFLOPs: 41.14 | 15: iteration 77660/ 125429 | consumed samples: 19880960 | consumed tokens: 40716206080 | elapsed time per iteration (s): 1.06 | learning rate: 7.811E-05 | global batch size: 256 | lm loss: 1.952905E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.799 | TFLOPs: 39.79 | 15: iteration 77670/ 125429 | consumed samples: 19883520 | consumed tokens: 40721448960 | elapsed time per iteration (s): 1.07 | learning rate: 7.808E-05 | global batch size: 256 | lm loss: 1.949543E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.460 | TFLOPs: 39.41 | 15: iteration 77680/ 125429 | consumed samples: 19886080 | consumed tokens: 40726691840 | elapsed time per iteration (s): 1.05 | learning rate: 7.806E-05 | global batch size: 256 | lm loss: 1.934840E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.833 | TFLOPs: 40.13 | 15: iteration 77690/ 125429 | consumed samples: 19888640 | consumed tokens: 40731934720 | elapsed time per iteration (s): 1.07 | learning rate: 7.804E-05 | global batch size: 256 | lm loss: 1.924300E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.066 | TFLOPs: 39.67 | 15: iteration 77700/ 125429 | consumed samples: 19891200 | consumed tokens: 40737177600 | elapsed time per iteration (s): 1.05 | learning rate: 7.802E-05 | global batch size: 256 | lm loss: 1.960023E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.451 | TFLOPs: 40.40 | 15: iteration 77710/ 125429 | consumed samples: 19893760 | consumed tokens: 40742420480 | elapsed time per iteration (s): 1.05 | learning rate: 7.800E-05 | global batch size: 256 | lm loss: 1.966369E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.928 | TFLOPs: 40.48 | 15: iteration 77720/ 125429 | consumed samples: 19896320 | consumed tokens: 40747663360 | elapsed time per iteration (s): 1.06 | learning rate: 7.798E-05 | global batch size: 256 | lm loss: 1.954006E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.588 | TFLOPs: 39.92 | 15: iteration 77730/ 125429 | consumed samples: 19898880 | consumed tokens: 40752906240 | elapsed time per iteration (s): 1.03 | learning rate: 7.796E-05 | global batch size: 256 | lm loss: 1.928561E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.446 | TFLOPs: 41.06 | 15: iteration 77740/ 125429 | consumed samples: 19901440 | consumed tokens: 40758149120 | elapsed time per iteration (s): 1.07 | learning rate: 7.794E-05 | global batch size: 256 | lm loss: 1.923983E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.104 | TFLOPs: 39.51 | 15: iteration 77750/ 125429 | consumed samples: 19904000 | consumed tokens: 40763392000 | elapsed time per iteration (s): 1.07 | learning rate: 7.791E-05 | global batch size: 256 | lm loss: 1.938102E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.527 | TFLOPs: 39.58 | 15: iteration 77760/ 125429 | consumed samples: 19906560 | consumed tokens: 40768634880 | elapsed time per iteration (s): 1.06 | learning rate: 7.789E-05 | global batch size: 256 | lm loss: 1.945802E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.529 | TFLOPs: 39.91 | 15: iteration 77770/ 125429 | consumed samples: 19909120 | consumed tokens: 40773877760 | elapsed time per iteration (s): 1.04 | learning rate: 7.787E-05 | global batch size: 256 | lm loss: 1.943460E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.486 | TFLOPs: 40.73 | 15: iteration 77780/ 125429 | consumed samples: 19911680 | consumed tokens: 40779120640 | elapsed time per iteration (s): 1.07 | learning rate: 7.785E-05 | global batch size: 256 | lm loss: 1.955117E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.396 | TFLOPs: 39.56 | 15: iteration 77790/ 125429 | consumed samples: 19914240 | consumed tokens: 40784363520 | elapsed time per iteration (s): 1.02 | learning rate: 7.783E-05 | global batch size: 256 | lm loss: 1.946638E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.151 | TFLOPs: 41.34 | 15: iteration 77800/ 125429 | consumed samples: 19916800 | consumed tokens: 40789606400 | elapsed time per iteration (s): 1.04 | learning rate: 7.781E-05 | global batch size: 256 | lm loss: 1.951884E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.169 | TFLOPs: 40.68 | 15: iteration 77810/ 125429 | consumed samples: 19919360 | consumed tokens: 40794849280 | elapsed time per iteration (s): 1.03 | learning rate: 7.779E-05 | global batch size: 256 | lm loss: 1.906652E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.415 | TFLOPs: 40.89 | 15: iteration 77820/ 125429 | consumed samples: 19921920 | consumed tokens: 40800092160 | elapsed time per iteration (s): 1.03 | learning rate: 7.777E-05 | global batch size: 256 | lm loss: 1.946079E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.531 | TFLOPs: 40.91 | 15: iteration 77830/ 125429 | consumed samples: 19924480 | consumed tokens: 40805335040 | elapsed time per iteration (s): 1.04 | learning rate: 7.774E-05 | global batch size: 256 | lm loss: 1.952041E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.937 | TFLOPs: 40.81 | 15: iteration 77840/ 125429 | consumed samples: 19927040 | consumed tokens: 40810577920 | elapsed time per iteration (s): 1.02 | learning rate: 7.772E-05 | global batch size: 256 | lm loss: 1.959849E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.671 | TFLOPs: 41.43 | 15: iteration 77850/ 125429 | consumed samples: 19929600 | consumed tokens: 40815820800 | elapsed time per iteration (s): 1.03 | learning rate: 7.770E-05 | global batch size: 256 | lm loss: 1.943664E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.520 | TFLOPs: 41.07 | 15: iteration 77860/ 125429 | consumed samples: 19932160 | consumed tokens: 40821063680 | elapsed time per iteration (s): 1.03 | learning rate: 7.768E-05 | global batch size: 256 | lm loss: 1.944536E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.689 | TFLOPs: 40.93 | 15: iteration 77870/ 125429 | consumed samples: 19934720 | consumed tokens: 40826306560 | elapsed time per iteration (s): 1.06 | learning rate: 7.766E-05 | global batch size: 256 | lm loss: 1.968873E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.135 | TFLOPs: 40.01 | 15: iteration 77880/ 125429 | consumed samples: 19937280 | consumed tokens: 40831549440 | elapsed time per iteration (s): 1.03 | learning rate: 7.764E-05 | global batch size: 256 | lm loss: 1.945911E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.862 | TFLOPs: 41.13 | 15: iteration 77890/ 125429 | consumed samples: 19939840 | consumed tokens: 40836792320 | elapsed time per iteration (s): 1.08 | learning rate: 7.762E-05 | global batch size: 256 | lm loss: 1.948826E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.683 | TFLOPs: 39.28 | 15: iteration 77900/ 125429 | consumed samples: 19942400 | consumed tokens: 40842035200 | elapsed time per iteration (s): 1.05 | learning rate: 7.760E-05 | global batch size: 256 | lm loss: 1.949308E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.945 | TFLOPs: 40.15 | 15: iteration 77910/ 125429 | consumed samples: 19944960 | consumed tokens: 40847278080 | elapsed time per iteration (s): 1.08 | learning rate: 7.757E-05 | global batch size: 256 | lm loss: 1.949236E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.202 | TFLOPs: 39.20 | 15: iteration 77920/ 125429 | consumed samples: 19947520 | consumed tokens: 40852520960 | elapsed time per iteration (s): 1.07 | learning rate: 7.755E-05 | global batch size: 256 | lm loss: 1.939192E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.911 | TFLOPs: 39.48 | 15: iteration 77930/ 125429 | consumed samples: 19950080 | consumed tokens: 40857763840 | elapsed time per iteration (s): 1.05 | learning rate: 7.753E-05 | global batch size: 256 | lm loss: 1.941053E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.787 | TFLOPs: 40.45 | 15: iteration 77940/ 125429 | consumed samples: 19952640 | consumed tokens: 40863006720 | elapsed time per iteration (s): 1.05 | learning rate: 7.751E-05 | global batch size: 256 | lm loss: 1.926006E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.984 | TFLOPs: 40.32 | 15: iteration 77950/ 125429 | consumed samples: 19955200 | consumed tokens: 40868249600 | elapsed time per iteration (s): 1.06 | learning rate: 7.749E-05 | global batch size: 256 | lm loss: 1.960539E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.168 | TFLOPs: 40.02 | 15: iteration 77960/ 125429 | consumed samples: 19957760 | consumed tokens: 40873492480 | elapsed time per iteration (s): 1.08 | learning rate: 7.747E-05 | global batch size: 256 | lm loss: 1.960145E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.533 | TFLOPs: 39.25 | 15: iteration 77970/ 125429 | consumed samples: 19960320 | consumed tokens: 40878735360 | elapsed time per iteration (s): 1.05 | learning rate: 7.745E-05 | global batch size: 256 | lm loss: 1.923243E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.874 | TFLOPs: 40.30 | 15: iteration 77980/ 125429 | consumed samples: 19962880 | consumed tokens: 40883978240 | elapsed time per iteration (s): 1.08 | learning rate: 7.743E-05 | global batch size: 256 | lm loss: 1.972727E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.090 | TFLOPs: 39.35 | 15: iteration 77990/ 125429 | consumed samples: 19965440 | consumed tokens: 40889221120 | elapsed time per iteration (s): 1.05 | learning rate: 7.740E-05 | global batch size: 256 | lm loss: 1.958432E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.691 | TFLOPs: 40.11 | 0: [2022-11-26 19:09:23,373] [INFO] [logging.py:68:log_dist] [Rank 0] step=78000, skipped=0, lr=[7.738311304382406e-05, 7.738311304382406e-05, 7.738311304382406e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 78000/ 125429 | consumed samples: 19968000 | consumed tokens: 40894464000 | elapsed time per iteration (s): 1.03 | learning rate: 7.738E-05 | global batch size: 256 | lm loss: 1.965895E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.079 | TFLOPs: 41.00 | 0: steps: 78000 loss: 2.0384 iter time (s): 1.052 samples/sec: 243.461 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 78000 | lm loss value: 1.925789E+00 | lm loss PPL: 6.860563E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 78000 to checkpoints_1b5 0: [2022-11-26 19:09:23,721] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step78000 is begin to save! 0: [2022-11-26 19:09:23,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:09:24,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:09:24,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:09:24,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:09:24,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:09:24,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:09:24,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:09:24,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:09:24,342] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:09:24,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:09:24,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:09:24,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:09:24,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:09:24,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:09:24,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:09:24,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:09:24,791] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:09:24,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:09:24,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:09:25,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:09:25,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:09:25,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:09:25,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:09:25,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:09:25,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:09:25,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:09:25,333] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:09:25,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:09:25,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:09:25,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:09:25,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:09:25,655] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:09:25,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:09:25,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:09:25,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:09:25,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:09:25,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:09:25,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:09:25,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:09:26,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:09:26,090] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:09:26,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:09:26,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:09:26,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:09:26,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:09:26,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:09:26,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:09:26,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:09:26,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:09:26,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:09:26,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:09:26,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:09:26,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:09:26,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:09:26,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_29-model_00-model_states.pt... 0: [2022-11-26 19:09:26,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_29-model_00-model_states.pt. 0: [2022-11-26 19:09:26,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:09:27,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:09:27,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/layer_32-model_00-model_states.pt... 0: [2022-11-26 19:09:27,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/layer_32-model_00-model_states.pt. 0: [2022-11-26 19:09:27,064] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step78000/mp_rank_00_model_states.pt 0: [2022-11-26 19:09:27,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:09:27,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/mp_rank_00_model_states.pt. 0: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:09:27,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:09:27,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step78000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:09:27,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 19:09:27,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:09:27,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 19:09:27,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:09:27,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 19:09:27,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:09:27,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:09:27,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 19:09:27,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:09:27,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:09:27,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:09:27,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:09:27,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:09:27,306] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,306] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 19:09:27,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:09:27,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 19:09:27,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:09:27,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:09:27,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 19:09:27,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 19:09:27,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:09:27,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 19:09:27,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:09:27,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 19:09:27,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 19:09:27,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 19:09:27,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 5: [2022-11-26 19:09:27,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:09:27,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 19:09:27,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 19:09:27,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:09:27,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 19:09:27,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 19:09:27,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 19:09:27,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 7: [2022-11-26 19:09:27,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 19:09:27,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 19:09:27,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 19:09:27,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 19:09:27,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 19:09:27,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 19:09:27,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 19:09:27,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:09:27,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 11: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:09:27,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 8: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:09:27,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 19:09:27,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 2: [2022-11-26 19:09:27,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:09:27,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 19:09:27,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 10: [2022-11-26 19:09:27,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:09:27,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 19:09:27,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:09:27,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:09:27,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 19:09:27,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 12: [2022-11-26 19:09:27,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 19:09:27,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:09:27,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:09:27,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 19:09:27,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 4: [2022-11-26 19:09:27,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:09:27,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 19:09:27,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 19:09:27,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 9: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:09:27,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 19:09:27,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 14: [2022-11-26 19:09:27,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 19:09:27,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 15: [2022-11-26 19:09:27,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 6: [2022-11-26 19:09:27,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:09:27,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:09:27,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 19:09:27,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 19:09:27,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 19:09:27,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 19:09:27,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 13: [2022-11-26 19:09:27,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:09:27,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 19:09:27,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: [2022-11-26 19:09:27,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 19:09:27,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 1: [2022-11-26 19:09:27,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step78000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 3: [2022-11-26 19:09:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step78000 is ready now! 0: successfully saved checkpoint at iteration 78000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3738.25 15: iteration 78010/ 125429 | consumed samples: 19970560 | consumed tokens: 40899706880 | elapsed time per iteration (s): 1.42 | learning rate: 7.736E-05 | global batch size: 256 | lm loss: 1.930761E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.067 | TFLOPs: 29.76 | 15: iteration 78020/ 125429 | consumed samples: 19973120 | consumed tokens: 40904949760 | elapsed time per iteration (s): 1.05 | learning rate: 7.734E-05 | global batch size: 256 | lm loss: 1.950079E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.398 | TFLOPs: 40.39 | 15: iteration 78030/ 125429 | consumed samples: 19975680 | consumed tokens: 40910192640 | elapsed time per iteration (s): 1.06 | learning rate: 7.732E-05 | global batch size: 256 | lm loss: 1.955492E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.393 | TFLOPs: 40.06 | 15: iteration 78040/ 125429 | consumed samples: 19978240 | consumed tokens: 40915435520 | elapsed time per iteration (s): 1.04 | learning rate: 7.730E-05 | global batch size: 256 | lm loss: 1.944088E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.051 | TFLOPs: 40.83 | 15: iteration 78050/ 125429 | consumed samples: 19980800 | consumed tokens: 40920678400 | elapsed time per iteration (s): 1.04 | learning rate: 7.728E-05 | global batch size: 256 | lm loss: 1.941895E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.485 | TFLOPs: 40.57 | 15: iteration 78060/ 125429 | consumed samples: 19983360 | consumed tokens: 40925921280 | elapsed time per iteration (s): 1.05 | learning rate: 7.726E-05 | global batch size: 256 | lm loss: 1.947276E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.590 | TFLOPs: 40.42 | 15: iteration 78070/ 125429 | consumed samples: 19985920 | consumed tokens: 40931164160 | elapsed time per iteration (s): 1.04 | learning rate: 7.723E-05 | global batch size: 256 | lm loss: 1.962799E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.302 | TFLOPs: 40.87 | 15: iteration 78080/ 125429 | consumed samples: 19988480 | consumed tokens: 40936407040 | elapsed time per iteration (s): 1.03 | learning rate: 7.721E-05 | global batch size: 256 | lm loss: 1.962812E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.796 | TFLOPs: 40.95 | 15: iteration 78090/ 125429 | consumed samples: 19991040 | consumed tokens: 40941649920 | elapsed time per iteration (s): 1.02 | learning rate: 7.719E-05 | global batch size: 256 | lm loss: 1.952468E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.645 | TFLOPs: 41.42 | 15: iteration 78100/ 125429 | consumed samples: 19993600 | consumed tokens: 40946892800 | elapsed time per iteration (s): 1.05 | learning rate: 7.717E-05 | global batch size: 256 | lm loss: 1.942792E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.846 | TFLOPs: 40.30 | 15: iteration 78110/ 125429 | consumed samples: 19996160 | consumed tokens: 40952135680 | elapsed time per iteration (s): 1.07 | learning rate: 7.715E-05 | global batch size: 256 | lm loss: 1.949249E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.433 | TFLOPs: 39.57 | 15: iteration 78120/ 125429 | consumed samples: 19998720 | consumed tokens: 40957378560 | elapsed time per iteration (s): 1.05 | learning rate: 7.713E-05 | global batch size: 256 | lm loss: 1.958262E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.174 | TFLOPs: 40.35 | 15: iteration 78130/ 125429 | consumed samples: 20001280 | consumed tokens: 40962621440 | elapsed time per iteration (s): 1.07 | learning rate: 7.711E-05 | global batch size: 256 | lm loss: 1.951272E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.190 | TFLOPs: 39.69 | 15: iteration 78140/ 125429 | consumed samples: 20003840 | consumed tokens: 40967864320 | elapsed time per iteration (s): 1.05 | learning rate: 7.709E-05 | global batch size: 256 | lm loss: 1.951903E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.899 | TFLOPs: 40.47 | 15: iteration 78150/ 125429 | consumed samples: 20006400 | consumed tokens: 40973107200 | elapsed time per iteration (s): 1.02 | learning rate: 7.707E-05 | global batch size: 256 | lm loss: 1.914388E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.223 | TFLOPs: 41.35 | 15: iteration 78160/ 125429 | consumed samples: 20008960 | consumed tokens: 40978350080 | elapsed time per iteration (s): 1.03 | learning rate: 7.704E-05 | global batch size: 256 | lm loss: 1.955113E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.558 | TFLOPs: 41.08 | 15: iteration 78170/ 125429 | consumed samples: 20011520 | consumed tokens: 40983592960 | elapsed time per iteration (s): 1.04 | learning rate: 7.702E-05 | global batch size: 256 | lm loss: 1.929107E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.569 | TFLOPs: 40.75 | 15: iteration 78180/ 125429 | consumed samples: 20014080 | consumed tokens: 40988835840 | elapsed time per iteration (s): 1.03 | learning rate: 7.700E-05 | global batch size: 256 | lm loss: 1.932861E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.400 | TFLOPs: 41.22 | 15: iteration 78190/ 125429 | consumed samples: 20016640 | consumed tokens: 40994078720 | elapsed time per iteration (s): 1.05 | learning rate: 7.698E-05 | global batch size: 256 | lm loss: 1.949111E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.500 | TFLOPs: 40.41 | 15: iteration 78200/ 125429 | consumed samples: 20019200 | consumed tokens: 40999321600 | elapsed time per iteration (s): 1.04 | learning rate: 7.696E-05 | global batch size: 256 | lm loss: 1.937358E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.418 | TFLOPs: 40.72 | 15: iteration 78210/ 125429 | consumed samples: 20021760 | consumed tokens: 41004564480 | elapsed time per iteration (s): 1.04 | learning rate: 7.694E-05 | global batch size: 256 | lm loss: 1.945960E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.316 | TFLOPs: 40.54 | 15: iteration 78220/ 125429 | consumed samples: 20024320 | consumed tokens: 41009807360 | elapsed time per iteration (s): 1.04 | learning rate: 7.692E-05 | global batch size: 256 | lm loss: 1.918865E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.674 | TFLOPs: 40.76 | 15: iteration 78230/ 125429 | consumed samples: 20026880 | consumed tokens: 41015050240 | elapsed time per iteration (s): 1.03 | learning rate: 7.690E-05 | global batch size: 256 | lm loss: 1.928643E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.114 | TFLOPs: 41.00 | 15: iteration 78240/ 125429 | consumed samples: 20029440 | consumed tokens: 41020293120 | elapsed time per iteration (s): 1.04 | learning rate: 7.687E-05 | global batch size: 256 | lm loss: 1.961716E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.954 | TFLOPs: 40.81 | 15: iteration 78250/ 125429 | consumed samples: 20032000 | consumed tokens: 41025536000 | elapsed time per iteration (s): 1.02 | learning rate: 7.685E-05 | global batch size: 256 | lm loss: 1.959060E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.465 | TFLOPs: 41.39 | 15: iteration 78260/ 125429 | consumed samples: 20034560 | consumed tokens: 41030778880 | elapsed time per iteration (s): 1.06 | learning rate: 7.683E-05 | global batch size: 256 | lm loss: 1.961955E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.374 | TFLOPs: 40.05 | 15: iteration 78270/ 125429 | consumed samples: 20037120 | consumed tokens: 41036021760 | elapsed time per iteration (s): 1.05 | learning rate: 7.681E-05 | global batch size: 256 | lm loss: 1.957642E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.724 | TFLOPs: 40.11 | 15: iteration 78280/ 125429 | consumed samples: 20039680 | consumed tokens: 41041264640 | elapsed time per iteration (s): 1.03 | learning rate: 7.679E-05 | global batch size: 256 | lm loss: 1.939849E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.648 | TFLOPs: 41.26 | 15: iteration 78290/ 125429 | consumed samples: 20042240 | consumed tokens: 41046507520 | elapsed time per iteration (s): 1.04 | learning rate: 7.677E-05 | global batch size: 256 | lm loss: 1.919656E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.451 | TFLOPs: 40.73 | 15: iteration 78300/ 125429 | consumed samples: 20044800 | consumed tokens: 41051750400 | elapsed time per iteration (s): 1.05 | learning rate: 7.675E-05 | global batch size: 256 | lm loss: 1.963959E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.246 | TFLOPs: 40.36 | 15: iteration 78310/ 125429 | consumed samples: 20047360 | consumed tokens: 41056993280 | elapsed time per iteration (s): 1.02 | learning rate: 7.673E-05 | global batch size: 256 | lm loss: 1.951298E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.275 | TFLOPs: 41.36 | 15: iteration 78320/ 125429 | consumed samples: 20049920 | consumed tokens: 41062236160 | elapsed time per iteration (s): 1.05 | learning rate: 7.671E-05 | global batch size: 256 | lm loss: 1.948528E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.694 | TFLOPs: 40.44 | 15: iteration 78330/ 125429 | consumed samples: 20052480 | consumed tokens: 41067479040 | elapsed time per iteration (s): 1.02 | learning rate: 7.668E-05 | global batch size: 256 | lm loss: 1.945267E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.502 | TFLOPs: 41.40 | 15: iteration 78340/ 125429 | consumed samples: 20055040 | consumed tokens: 41072721920 | elapsed time per iteration (s): 1.07 | learning rate: 7.666E-05 | global batch size: 256 | lm loss: 1.950174E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.885 | TFLOPs: 39.64 | 15: iteration 78350/ 125429 | consumed samples: 20057600 | consumed tokens: 41077964800 | elapsed time per iteration (s): 1.06 | learning rate: 7.664E-05 | global batch size: 256 | lm loss: 1.936227E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.051 | TFLOPs: 40.00 | 15: iteration 78360/ 125429 | consumed samples: 20060160 | consumed tokens: 41083207680 | elapsed time per iteration (s): 1.05 | learning rate: 7.662E-05 | global batch size: 256 | lm loss: 1.934904E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.052 | TFLOPs: 40.33 | 15: iteration 78370/ 125429 | consumed samples: 20062720 | consumed tokens: 41088450560 | elapsed time per iteration (s): 1.03 | learning rate: 7.660E-05 | global batch size: 256 | lm loss: 1.949354E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.734 | TFLOPs: 41.11 | 15: iteration 78380/ 125429 | consumed samples: 20065280 | consumed tokens: 41093693440 | elapsed time per iteration (s): 1.05 | learning rate: 7.658E-05 | global batch size: 256 | lm loss: 1.968573E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.262 | TFLOPs: 40.20 | 15: iteration 78390/ 125429 | consumed samples: 20067840 | consumed tokens: 41098936320 | elapsed time per iteration (s): 1.03 | learning rate: 7.656E-05 | global batch size: 256 | lm loss: 1.909777E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.035 | TFLOPs: 41.15 | 15: iteration 78400/ 125429 | consumed samples: 20070400 | consumed tokens: 41104179200 | elapsed time per iteration (s): 1.06 | learning rate: 7.654E-05 | global batch size: 256 | lm loss: 1.947402E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.256 | TFLOPs: 40.03 | 15: iteration 78410/ 125429 | consumed samples: 20072960 | consumed tokens: 41109422080 | elapsed time per iteration (s): 1.03 | learning rate: 7.651E-05 | global batch size: 256 | lm loss: 1.952055E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.691 | TFLOPs: 40.93 | 15: iteration 78420/ 125429 | consumed samples: 20075520 | consumed tokens: 41114664960 | elapsed time per iteration (s): 1.05 | learning rate: 7.649E-05 | global batch size: 256 | lm loss: 1.959482E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.904 | TFLOPs: 40.14 | 15: iteration 78430/ 125429 | consumed samples: 20078080 | consumed tokens: 41119907840 | elapsed time per iteration (s): 1.04 | learning rate: 7.647E-05 | global batch size: 256 | lm loss: 1.958613E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.287 | TFLOPs: 40.87 | 15: iteration 78440/ 125429 | consumed samples: 20080640 | consumed tokens: 41125150720 | elapsed time per iteration (s): 1.06 | learning rate: 7.645E-05 | global batch size: 256 | lm loss: 1.943750E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.544 | TFLOPs: 40.08 | 15: iteration 78450/ 125429 | consumed samples: 20083200 | consumed tokens: 41130393600 | elapsed time per iteration (s): 1.05 | learning rate: 7.643E-05 | global batch size: 256 | lm loss: 1.936739E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.935 | TFLOPs: 40.48 | 15: iteration 78460/ 125429 | consumed samples: 20085760 | consumed tokens: 41135636480 | elapsed time per iteration (s): 1.03 | learning rate: 7.641E-05 | global batch size: 256 | lm loss: 1.945631E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.893 | TFLOPs: 40.97 | 15: iteration 78470/ 125429 | consumed samples: 20088320 | consumed tokens: 41140879360 | elapsed time per iteration (s): 1.08 | learning rate: 7.639E-05 | global batch size: 256 | lm loss: 1.937496E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.565 | TFLOPs: 39.26 | 15: iteration 78480/ 125429 | consumed samples: 20090880 | consumed tokens: 41146122240 | elapsed time per iteration (s): 1.08 | learning rate: 7.637E-05 | global batch size: 256 | lm loss: 1.984767E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.121 | TFLOPs: 39.19 | 15: iteration 78490/ 125429 | consumed samples: 20093440 | consumed tokens: 41151365120 | elapsed time per iteration (s): 1.04 | learning rate: 7.635E-05 | global batch size: 256 | lm loss: 1.972935E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.072 | TFLOPs: 40.67 | 15: iteration 78500/ 125429 | consumed samples: 20096000 | consumed tokens: 41156608000 | elapsed time per iteration (s): 1.03 | learning rate: 7.632E-05 | global batch size: 256 | lm loss: 1.960337E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.921 | TFLOPs: 40.97 | 15: iteration 78510/ 125429 | consumed samples: 20098560 | consumed tokens: 41161850880 | elapsed time per iteration (s): 1.04 | learning rate: 7.630E-05 | global batch size: 256 | lm loss: 1.951380E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.666 | TFLOPs: 40.76 | 15: iteration 78520/ 125429 | consumed samples: 20101120 | consumed tokens: 41167093760 | elapsed time per iteration (s): 1.08 | learning rate: 7.628E-05 | global batch size: 256 | lm loss: 1.961319E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.979 | TFLOPs: 39.00 | 15: iteration 78530/ 125429 | consumed samples: 20103680 | consumed tokens: 41172336640 | elapsed time per iteration (s): 1.04 | learning rate: 7.626E-05 | global batch size: 256 | lm loss: 1.976875E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.521 | TFLOPs: 40.74 | 15: iteration 78540/ 125429 | consumed samples: 20106240 | consumed tokens: 41177579520 | elapsed time per iteration (s): 1.05 | learning rate: 7.624E-05 | global batch size: 256 | lm loss: 1.951693E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.020 | TFLOPs: 40.16 | 15: iteration 78550/ 125429 | consumed samples: 20108800 | consumed tokens: 41182822400 | elapsed time per iteration (s): 1.03 | learning rate: 7.622E-05 | global batch size: 256 | lm loss: 1.967978E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.972 | TFLOPs: 40.98 | 15: iteration 78560/ 125429 | consumed samples: 20111360 | consumed tokens: 41188065280 | elapsed time per iteration (s): 1.04 | learning rate: 7.620E-05 | global batch size: 256 | lm loss: 1.939670E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.402 | TFLOPs: 40.55 | 15: iteration 78570/ 125429 | consumed samples: 20113920 | consumed tokens: 41193308160 | elapsed time per iteration (s): 1.03 | learning rate: 7.618E-05 | global batch size: 256 | lm loss: 1.931389E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.131 | TFLOPs: 41.01 | 15: iteration 78580/ 125429 | consumed samples: 20116480 | consumed tokens: 41198551040 | elapsed time per iteration (s): 1.10 | learning rate: 7.616E-05 | global batch size: 256 | lm loss: 1.936956E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.705 | TFLOPs: 38.29 | 15: iteration 78590/ 125429 | consumed samples: 20119040 | consumed tokens: 41203793920 | elapsed time per iteration (s): 1.05 | learning rate: 7.613E-05 | global batch size: 256 | lm loss: 1.927131E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.009 | TFLOPs: 40.32 | 15: iteration 78600/ 125429 | consumed samples: 20121600 | consumed tokens: 41209036800 | elapsed time per iteration (s): 1.07 | learning rate: 7.611E-05 | global batch size: 256 | lm loss: 1.918052E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.607 | TFLOPs: 39.60 | 15: iteration 78610/ 125429 | consumed samples: 20124160 | consumed tokens: 41214279680 | elapsed time per iteration (s): 1.06 | learning rate: 7.609E-05 | global batch size: 256 | lm loss: 1.954496E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.450 | TFLOPs: 39.74 | 15: iteration 78620/ 125429 | consumed samples: 20126720 | consumed tokens: 41219522560 | elapsed time per iteration (s): 1.04 | learning rate: 7.607E-05 | global batch size: 256 | lm loss: 1.949802E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.094 | TFLOPs: 40.50 | 15: iteration 78630/ 125429 | consumed samples: 20129280 | consumed tokens: 41224765440 | elapsed time per iteration (s): 1.06 | learning rate: 7.605E-05 | global batch size: 256 | lm loss: 1.946288E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.925 | TFLOPs: 39.81 | 15: iteration 78640/ 125429 | consumed samples: 20131840 | consumed tokens: 41230008320 | elapsed time per iteration (s): 1.05 | learning rate: 7.603E-05 | global batch size: 256 | lm loss: 1.947724E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.773 | TFLOPs: 40.29 | 15: iteration 78650/ 125429 | consumed samples: 20134400 | consumed tokens: 41235251200 | elapsed time per iteration (s): 1.06 | learning rate: 7.601E-05 | global batch size: 256 | lm loss: 1.954051E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.237 | TFLOPs: 39.87 | 15: iteration 78660/ 125429 | consumed samples: 20136960 | consumed tokens: 41240494080 | elapsed time per iteration (s): 1.02 | learning rate: 7.599E-05 | global batch size: 256 | lm loss: 1.957640E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.899 | TFLOPs: 41.30 | 15: iteration 78670/ 125429 | consumed samples: 20139520 | consumed tokens: 41245736960 | elapsed time per iteration (s): 1.06 | learning rate: 7.597E-05 | global batch size: 256 | lm loss: 1.949052E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.807 | TFLOPs: 39.80 | 15: iteration 78680/ 125429 | consumed samples: 20142080 | consumed tokens: 41250979840 | elapsed time per iteration (s): 1.05 | learning rate: 7.594E-05 | global batch size: 256 | lm loss: 1.954715E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.467 | TFLOPs: 40.40 | 15: iteration 78690/ 125429 | consumed samples: 20144640 | consumed tokens: 41256222720 | elapsed time per iteration (s): 1.04 | learning rate: 7.592E-05 | global batch size: 256 | lm loss: 1.941218E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.854 | TFLOPs: 40.79 | 15: iteration 78700/ 125429 | consumed samples: 20147200 | consumed tokens: 41261465600 | elapsed time per iteration (s): 1.04 | learning rate: 7.590E-05 | global batch size: 256 | lm loss: 1.954251E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.583 | TFLOPs: 40.75 | 15: iteration 78710/ 125429 | consumed samples: 20149760 | consumed tokens: 41266708480 | elapsed time per iteration (s): 1.03 | learning rate: 7.588E-05 | global batch size: 256 | lm loss: 1.933191E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.442 | TFLOPs: 41.22 | 15: iteration 78720/ 125429 | consumed samples: 20152320 | consumed tokens: 41271951360 | elapsed time per iteration (s): 1.02 | learning rate: 7.586E-05 | global batch size: 256 | lm loss: 1.951781E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.423 | TFLOPs: 41.38 | 15: iteration 78730/ 125429 | consumed samples: 20154880 | consumed tokens: 41277194240 | elapsed time per iteration (s): 1.04 | learning rate: 7.584E-05 | global batch size: 256 | lm loss: 1.958183E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.310 | TFLOPs: 40.70 | 15: iteration 78740/ 125429 | consumed samples: 20157440 | consumed tokens: 41282437120 | elapsed time per iteration (s): 1.03 | learning rate: 7.582E-05 | global batch size: 256 | lm loss: 1.954541E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.917 | TFLOPs: 40.97 | 15: iteration 78750/ 125429 | consumed samples: 20160000 | consumed tokens: 41287680000 | elapsed time per iteration (s): 1.06 | learning rate: 7.580E-05 | global batch size: 256 | lm loss: 1.934291E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.791 | TFLOPs: 39.96 | 15: iteration 78760/ 125429 | consumed samples: 20162560 | consumed tokens: 41292922880 | elapsed time per iteration (s): 1.03 | learning rate: 7.578E-05 | global batch size: 256 | lm loss: 1.937311E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.515 | TFLOPs: 41.07 | 15: iteration 78770/ 125429 | consumed samples: 20165120 | consumed tokens: 41298165760 | elapsed time per iteration (s): 1.07 | learning rate: 7.576E-05 | global batch size: 256 | lm loss: 1.949886E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.588 | TFLOPs: 39.43 | 15: iteration 78780/ 125429 | consumed samples: 20167680 | consumed tokens: 41303408640 | elapsed time per iteration (s): 1.05 | learning rate: 7.573E-05 | global batch size: 256 | lm loss: 1.951514E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.730 | TFLOPs: 40.44 | 15: iteration 78790/ 125429 | consumed samples: 20170240 | consumed tokens: 41308651520 | elapsed time per iteration (s): 1.04 | learning rate: 7.571E-05 | global batch size: 256 | lm loss: 1.916661E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.133 | TFLOPs: 40.51 | 15: iteration 78800/ 125429 | consumed samples: 20172800 | consumed tokens: 41313894400 | elapsed time per iteration (s): 1.03 | learning rate: 7.569E-05 | global batch size: 256 | lm loss: 1.911225E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.533 | TFLOPs: 40.91 | 15: iteration 78810/ 125429 | consumed samples: 20175360 | consumed tokens: 41319137280 | elapsed time per iteration (s): 1.09 | learning rate: 7.567E-05 | global batch size: 256 | lm loss: 1.946959E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.934 | TFLOPs: 38.82 | 15: iteration 78820/ 125429 | consumed samples: 20177920 | consumed tokens: 41324380160 | elapsed time per iteration (s): 1.04 | learning rate: 7.565E-05 | global batch size: 256 | lm loss: 1.951974E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.536 | TFLOPs: 40.58 | 15: iteration 78830/ 125429 | consumed samples: 20180480 | consumed tokens: 41329623040 | elapsed time per iteration (s): 1.05 | learning rate: 7.563E-05 | global batch size: 256 | lm loss: 1.958928E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.688 | TFLOPs: 40.11 | 15: iteration 78840/ 125429 | consumed samples: 20183040 | consumed tokens: 41334865920 | elapsed time per iteration (s): 1.02 | learning rate: 7.561E-05 | global batch size: 256 | lm loss: 1.955720E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.007 | TFLOPs: 41.32 | 15: iteration 78850/ 125429 | consumed samples: 20185600 | consumed tokens: 41340108800 | elapsed time per iteration (s): 1.03 | learning rate: 7.559E-05 | global batch size: 256 | lm loss: 1.953394E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.649 | TFLOPs: 41.09 | 15: iteration 78860/ 125429 | consumed samples: 20188160 | consumed tokens: 41345351680 | elapsed time per iteration (s): 1.04 | learning rate: 7.557E-05 | global batch size: 256 | lm loss: 1.967978E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.906 | TFLOPs: 40.80 | 15: iteration 78870/ 125429 | consumed samples: 20190720 | consumed tokens: 41350594560 | elapsed time per iteration (s): 1.02 | learning rate: 7.554E-05 | global batch size: 256 | lm loss: 1.942192E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.192 | TFLOPs: 41.35 | 15: iteration 78880/ 125429 | consumed samples: 20193280 | consumed tokens: 41355837440 | elapsed time per iteration (s): 1.05 | learning rate: 7.552E-05 | global batch size: 256 | lm loss: 1.923692E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.516 | TFLOPs: 40.24 | 15: iteration 78890/ 125429 | consumed samples: 20195840 | consumed tokens: 41361080320 | elapsed time per iteration (s): 1.07 | learning rate: 7.550E-05 | global batch size: 256 | lm loss: 1.978879E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.125 | TFLOPs: 39.52 | 15: iteration 78900/ 125429 | consumed samples: 20198400 | consumed tokens: 41366323200 | elapsed time per iteration (s): 1.04 | learning rate: 7.548E-05 | global batch size: 256 | lm loss: 1.932491E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.473 | TFLOPs: 40.73 | 15: iteration 78910/ 125429 | consumed samples: 20200960 | consumed tokens: 41371566080 | elapsed time per iteration (s): 1.07 | learning rate: 7.546E-05 | global batch size: 256 | lm loss: 1.966042E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.890 | TFLOPs: 39.64 | 15: iteration 78920/ 125429 | consumed samples: 20203520 | consumed tokens: 41376808960 | elapsed time per iteration (s): 1.05 | learning rate: 7.544E-05 | global batch size: 256 | lm loss: 1.932216E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.379 | TFLOPs: 40.22 | 15: iteration 78930/ 125429 | consumed samples: 20206080 | consumed tokens: 41382051840 | elapsed time per iteration (s): 1.03 | learning rate: 7.542E-05 | global batch size: 256 | lm loss: 1.940956E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.099 | TFLOPs: 41.00 | 15: iteration 78940/ 125429 | consumed samples: 20208640 | consumed tokens: 41387294720 | elapsed time per iteration (s): 1.02 | learning rate: 7.540E-05 | global batch size: 256 | lm loss: 1.972344E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.646 | TFLOPs: 41.59 | 15: iteration 78950/ 125429 | consumed samples: 20211200 | consumed tokens: 41392537600 | elapsed time per iteration (s): 1.03 | learning rate: 7.538E-05 | global batch size: 256 | lm loss: 1.952076E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.082 | TFLOPs: 41.00 | 15: iteration 78960/ 125429 | consumed samples: 20213760 | consumed tokens: 41397780480 | elapsed time per iteration (s): 1.07 | learning rate: 7.536E-05 | global batch size: 256 | lm loss: 1.926409E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.530 | TFLOPs: 39.42 | 15: iteration 78970/ 125429 | consumed samples: 20216320 | consumed tokens: 41403023360 | elapsed time per iteration (s): 1.03 | learning rate: 7.533E-05 | global batch size: 256 | lm loss: 1.938008E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.708 | TFLOPs: 41.10 | 15: iteration 78980/ 125429 | consumed samples: 20218880 | consumed tokens: 41408266240 | elapsed time per iteration (s): 1.05 | learning rate: 7.531E-05 | global batch size: 256 | lm loss: 1.951094E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.513 | TFLOPs: 40.41 | 15: iteration 78990/ 125429 | consumed samples: 20221440 | consumed tokens: 41413509120 | elapsed time per iteration (s): 1.07 | learning rate: 7.529E-05 | global batch size: 256 | lm loss: 1.936302E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.367 | TFLOPs: 39.39 | 15: iteration 79000/ 125429 | consumed samples: 20224000 | consumed tokens: 41418752000 | elapsed time per iteration (s): 1.03 | learning rate: 7.527E-05 | global batch size: 256 | lm loss: 1.938606E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.405 | TFLOPs: 41.22 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 79000 | lm loss value: 1.942725E+00 | lm loss PPL: 6.977738E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 79000 to checkpoints_1b5 0: [2022-11-26 19:26:52,669] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step79000 is begin to save! 0: [2022-11-26 19:26:52,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:26:52,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:26:52,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:26:53,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:26:53,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:26:53,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:26:53,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:26:53,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:26:53,213] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:26:53,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:26:53,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:26:53,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:26:53,426] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:26:53,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:26:53,532] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:26:53,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:26:53,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:26:53,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:26:53,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:26:53,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:26:53,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:26:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:26:53,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:26:54,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:26:54,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:26:54,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:26:54,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:26:54,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:26:54,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:26:54,384] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:26:54,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:26:54,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:26:54,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:26:54,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:26:54,599] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:26:54,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:26:54,707] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:26:54,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:26:54,813] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:26:54,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:26:54,911] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:26:55,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:26:55,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:26:55,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:26:55,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:26:55,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:26:55,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:26:55,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:26:55,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:26:55,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:26:55,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:26:55,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:26:55,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:26:55,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:26:55,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_29-model_00-model_states.pt... 0: [2022-11-26 19:26:55,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_29-model_00-model_states.pt. 0: [2022-11-26 19:26:55,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:26:55,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:26:55,846] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/layer_32-model_00-model_states.pt... 0: [2022-11-26 19:26:55,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/layer_32-model_00-model_states.pt. 0: [2022-11-26 19:26:55,851] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step79000/mp_rank_00_model_states.pt 0: [2022-11-26 19:26:55,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:26:55,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/mp_rank_00_model_states.pt. 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:26:55,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step79000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:26:56,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 19:26:56,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 19:26:56,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 19:26:56,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 19:26:56,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 19:26:56,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 19:26:56,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 19:26:56,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 19:26:56,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 19:26:56,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:26:56,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:26:56,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:26:56,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 19:26:56,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:26:56,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:26:56,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:26:56,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 13: [2022-11-26 19:26:56,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 19:26:56,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 11: [2022-11-26 19:26:56,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:26:56,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 19:26:56,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:26:56,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 19:26:56,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 19:26:56,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 19:26:56,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 19:26:56,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 19:26:56,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 19:26:56,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 19:26:56,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,088] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 19:26:56,088] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 19:26:56,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 19:26:56,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 19:26:56,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 19:26:56,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 19:26:56,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 19:26:56,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 19:26:56,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 9: [2022-11-26 19:26:56,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:26:56,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:26:56,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:26:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:26:56,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:26:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:26:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 8: [2022-11-26 19:26:56,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 19:26:56,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 7: [2022-11-26 19:26:56,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 7: [2022-11-26 19:26:56,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:26:56,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 14: [2022-11-26 19:26:56,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:26:56,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 19:26:56,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 19:26:56,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 10: [2022-11-26 19:26:56,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:26:56,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 19:26:56,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:26:56,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:26:56,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 1: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:26:56,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 6: [2022-11-26 19:26:56,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:26:56,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 19:26:56,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 19:26:56,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 19:26:56,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 19:26:56,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 13: [2022-11-26 19:26:56,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:26:56,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 19:26:56,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 19:26:56,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 15: [2022-11-26 19:26:56,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:26:56,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 19:26:56,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 4: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:26:56,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 19:26:56,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,165] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,165] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 5: [2022-11-26 19:26:56,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:26:56,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 19:26:56,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 12: [2022-11-26 19:26:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:26:56,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 19:26:56,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 3: [2022-11-26 19:26:56,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: [2022-11-26 19:26:56,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 19:26:56,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step79000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 2: [2022-11-26 19:26:56,305] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step79000 is ready now! 0: successfully saved checkpoint at iteration 79000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3684.70 15: iteration 79010/ 125429 | consumed samples: 20226560 | consumed tokens: 41423994880 | elapsed time per iteration (s): 1.45 | learning rate: 7.525E-05 | global batch size: 256 | lm loss: 1.947666E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.180 | TFLOPs: 29.12 | 15: iteration 79020/ 125429 | consumed samples: 20229120 | consumed tokens: 41429237760 | elapsed time per iteration (s): 1.02 | learning rate: 7.523E-05 | global batch size: 256 | lm loss: 1.951541E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.137 | TFLOPs: 41.34 | 15: iteration 79030/ 125429 | consumed samples: 20231680 | consumed tokens: 41434480640 | elapsed time per iteration (s): 1.07 | learning rate: 7.521E-05 | global batch size: 256 | lm loss: 1.946624E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.593 | TFLOPs: 39.59 | 15: iteration 79040/ 125429 | consumed samples: 20234240 | consumed tokens: 41439723520 | elapsed time per iteration (s): 1.02 | learning rate: 7.519E-05 | global batch size: 256 | lm loss: 1.950682E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.638 | TFLOPs: 41.42 | 15: iteration 79050/ 125429 | consumed samples: 20236800 | consumed tokens: 41444966400 | elapsed time per iteration (s): 1.05 | learning rate: 7.517E-05 | global batch size: 256 | lm loss: 1.973579E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.645 | TFLOPs: 40.43 | 15: iteration 79060/ 125429 | consumed samples: 20239360 | consumed tokens: 41450209280 | elapsed time per iteration (s): 1.06 | learning rate: 7.515E-05 | global batch size: 256 | lm loss: 1.940898E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.646 | TFLOPs: 40.10 | 15: iteration 79070/ 125429 | consumed samples: 20241920 | consumed tokens: 41455452160 | elapsed time per iteration (s): 1.07 | learning rate: 7.512E-05 | global batch size: 256 | lm loss: 1.916025E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.372 | TFLOPs: 39.72 | 15: iteration 79080/ 125429 | consumed samples: 20244480 | consumed tokens: 41460695040 | elapsed time per iteration (s): 1.06 | learning rate: 7.510E-05 | global batch size: 256 | lm loss: 1.944968E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.201 | TFLOPs: 39.86 | 15: iteration 79090/ 125429 | consumed samples: 20247040 | consumed tokens: 41465937920 | elapsed time per iteration (s): 1.07 | learning rate: 7.508E-05 | global batch size: 256 | lm loss: 1.948758E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.517 | TFLOPs: 39.58 | 15: iteration 79100/ 125429 | consumed samples: 20249600 | consumed tokens: 41471180800 | elapsed time per iteration (s): 1.06 | learning rate: 7.506E-05 | global batch size: 256 | lm loss: 1.935704E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.682 | TFLOPs: 39.94 | 15: iteration 79110/ 125429 | consumed samples: 20252160 | consumed tokens: 41476423680 | elapsed time per iteration (s): 1.04 | learning rate: 7.504E-05 | global batch size: 256 | lm loss: 1.973235E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.142 | TFLOPs: 40.68 | 15: iteration 79120/ 125429 | consumed samples: 20254720 | consumed tokens: 41481666560 | elapsed time per iteration (s): 1.04 | learning rate: 7.502E-05 | global batch size: 256 | lm loss: 1.939831E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.850 | TFLOPs: 40.63 | 15: iteration 79130/ 125429 | consumed samples: 20257280 | consumed tokens: 41486909440 | elapsed time per iteration (s): 1.04 | learning rate: 7.500E-05 | global batch size: 256 | lm loss: 1.961038E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.304 | TFLOPs: 40.70 | 15: iteration 79140/ 125429 | consumed samples: 20259840 | consumed tokens: 41492152320 | elapsed time per iteration (s): 1.03 | learning rate: 7.498E-05 | global batch size: 256 | lm loss: 1.935407E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.506 | TFLOPs: 41.07 | 15: iteration 79150/ 125429 | consumed samples: 20262400 | consumed tokens: 41497395200 | elapsed time per iteration (s): 1.03 | learning rate: 7.496E-05 | global batch size: 256 | lm loss: 1.968088E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.747 | TFLOPs: 40.94 | 15: iteration 79160/ 125429 | consumed samples: 20264960 | consumed tokens: 41502638080 | elapsed time per iteration (s): 1.04 | learning rate: 7.494E-05 | global batch size: 256 | lm loss: 1.948895E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.487 | TFLOPs: 40.73 | 15: iteration 79170/ 125429 | consumed samples: 20267520 | consumed tokens: 41507880960 | elapsed time per iteration (s): 1.04 | learning rate: 7.491E-05 | global batch size: 256 | lm loss: 1.921465E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.574 | TFLOPs: 40.75 | 15: iteration 79180/ 125429 | consumed samples: 20270080 | consumed tokens: 41513123840 | elapsed time per iteration (s): 1.03 | learning rate: 7.489E-05 | global batch size: 256 | lm loss: 1.947414E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.917 | TFLOPs: 40.97 | 15: iteration 79190/ 125429 | consumed samples: 20272640 | consumed tokens: 41518366720 | elapsed time per iteration (s): 1.04 | learning rate: 7.487E-05 | global batch size: 256 | lm loss: 1.971004E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.385 | TFLOPs: 40.72 | 15: iteration 79200/ 125429 | consumed samples: 20275200 | consumed tokens: 41523609600 | elapsed time per iteration (s): 1.07 | learning rate: 7.485E-05 | global batch size: 256 | lm loss: 1.939681E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.961 | TFLOPs: 39.66 | 15: iteration 79210/ 125429 | consumed samples: 20277760 | consumed tokens: 41528852480 | elapsed time per iteration (s): 1.05 | learning rate: 7.483E-05 | global batch size: 256 | lm loss: 1.945778E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.309 | TFLOPs: 40.21 | 15: iteration 79220/ 125429 | consumed samples: 20280320 | consumed tokens: 41534095360 | elapsed time per iteration (s): 1.04 | learning rate: 7.481E-05 | global batch size: 256 | lm loss: 1.961983E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.838 | TFLOPs: 40.63 | 15: iteration 79230/ 125429 | consumed samples: 20282880 | consumed tokens: 41539338240 | elapsed time per iteration (s): 1.06 | learning rate: 7.479E-05 | global batch size: 256 | lm loss: 1.960784E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.040 | TFLOPs: 40.00 | 15: iteration 79240/ 125429 | consumed samples: 20285440 | consumed tokens: 41544581120 | elapsed time per iteration (s): 1.03 | learning rate: 7.477E-05 | global batch size: 256 | lm loss: 1.922146E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.055 | TFLOPs: 41.16 | 15: iteration 79250/ 125429 | consumed samples: 20288000 | consumed tokens: 41549824000 | elapsed time per iteration (s): 1.05 | learning rate: 7.475E-05 | global batch size: 256 | lm loss: 1.944900E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.104 | TFLOPs: 40.17 | 15: iteration 79260/ 125429 | consumed samples: 20290560 | consumed tokens: 41555066880 | elapsed time per iteration (s): 1.02 | learning rate: 7.473E-05 | global batch size: 256 | lm loss: 1.950977E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.210 | TFLOPs: 41.35 | 15: iteration 79270/ 125429 | consumed samples: 20293120 | consumed tokens: 41560309760 | elapsed time per iteration (s): 1.08 | learning rate: 7.471E-05 | global batch size: 256 | lm loss: 1.946729E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.032 | TFLOPs: 39.01 | 15: iteration 79280/ 125429 | consumed samples: 20295680 | consumed tokens: 41565552640 | elapsed time per iteration (s): 1.06 | learning rate: 7.468E-05 | global batch size: 256 | lm loss: 1.958123E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.705 | TFLOPs: 39.94 | 15: iteration 79290/ 125429 | consumed samples: 20298240 | consumed tokens: 41570795520 | elapsed time per iteration (s): 1.04 | learning rate: 7.466E-05 | global batch size: 256 | lm loss: 1.976101E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.549 | TFLOPs: 40.58 | 15: iteration 79300/ 125429 | consumed samples: 20300800 | consumed tokens: 41576038400 | elapsed time per iteration (s): 1.04 | learning rate: 7.464E-05 | global batch size: 256 | lm loss: 1.945325E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.638 | TFLOPs: 40.76 | 15: iteration 79310/ 125429 | consumed samples: 20303360 | consumed tokens: 41581281280 | elapsed time per iteration (s): 1.09 | learning rate: 7.462E-05 | global batch size: 256 | lm loss: 1.904830E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.561 | TFLOPs: 38.93 | 15: iteration 79320/ 125429 | consumed samples: 20305920 | consumed tokens: 41586524160 | elapsed time per iteration (s): 1.04 | learning rate: 7.460E-05 | global batch size: 256 | lm loss: 1.940665E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.961 | TFLOPs: 40.81 | 15: iteration 79330/ 125429 | consumed samples: 20308480 | consumed tokens: 41591767040 | elapsed time per iteration (s): 1.10 | learning rate: 7.458E-05 | global batch size: 256 | lm loss: 1.941805E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.640 | TFLOPs: 38.61 | 15: iteration 79340/ 125429 | consumed samples: 20311040 | consumed tokens: 41597009920 | elapsed time per iteration (s): 1.06 | learning rate: 7.456E-05 | global batch size: 256 | lm loss: 1.968040E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.646 | TFLOPs: 39.77 | 15: iteration 79350/ 125429 | consumed samples: 20313600 | consumed tokens: 41602252800 | elapsed time per iteration (s): 1.05 | learning rate: 7.454E-05 | global batch size: 256 | lm loss: 1.946440E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.056 | TFLOPs: 40.17 | 15: iteration 79360/ 125429 | consumed samples: 20316160 | consumed tokens: 41607495680 | elapsed time per iteration (s): 1.04 | learning rate: 7.452E-05 | global batch size: 256 | lm loss: 1.963710E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.746 | TFLOPs: 40.78 | 15: iteration 79370/ 125429 | consumed samples: 20318720 | consumed tokens: 41612738560 | elapsed time per iteration (s): 1.06 | learning rate: 7.450E-05 | global batch size: 256 | lm loss: 1.936693E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.141 | TFLOPs: 40.02 | 15: iteration 79380/ 125429 | consumed samples: 20321280 | consumed tokens: 41617981440 | elapsed time per iteration (s): 1.02 | learning rate: 7.447E-05 | global batch size: 256 | lm loss: 1.948145E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.960 | TFLOPs: 41.31 | 15: iteration 79390/ 125429 | consumed samples: 20323840 | consumed tokens: 41623224320 | elapsed time per iteration (s): 1.04 | learning rate: 7.445E-05 | global batch size: 256 | lm loss: 1.974543E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.134 | TFLOPs: 40.84 | 15: iteration 79400/ 125429 | consumed samples: 20326400 | consumed tokens: 41628467200 | elapsed time per iteration (s): 1.16 | learning rate: 7.443E-05 | global batch size: 256 | lm loss: 1.970260E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.123 | TFLOPs: 36.54 | 15: iteration 79410/ 125429 | consumed samples: 20328960 | consumed tokens: 41633710080 | elapsed time per iteration (s): 1.07 | learning rate: 7.441E-05 | global batch size: 256 | lm loss: 1.957159E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.623 | TFLOPs: 39.60 | 15: iteration 79420/ 125429 | consumed samples: 20331520 | consumed tokens: 41638952960 | elapsed time per iteration (s): 1.03 | learning rate: 7.439E-05 | global batch size: 256 | lm loss: 1.957309E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.706 | TFLOPs: 40.94 | 15: iteration 79430/ 125429 | consumed samples: 20334080 | consumed tokens: 41644195840 | elapsed time per iteration (s): 1.02 | learning rate: 7.437E-05 | global batch size: 256 | lm loss: 1.955247E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.501 | TFLOPs: 41.56 | 15: iteration 79440/ 125429 | consumed samples: 20336640 | consumed tokens: 41649438720 | elapsed time per iteration (s): 1.05 | learning rate: 7.435E-05 | global batch size: 256 | lm loss: 1.948895E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.955 | TFLOPs: 40.15 | 15: iteration 79450/ 125429 | consumed samples: 20339200 | consumed tokens: 41654681600 | elapsed time per iteration (s): 2.87 | learning rate: 7.433E-05 | global batch size: 256 | lm loss: 1.943436E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 89.349 | TFLOPs: 14.77 | 15: iteration 79460/ 125429 | consumed samples: 20341760 | consumed tokens: 41659924480 | elapsed time per iteration (s): 1.09 | learning rate: 7.431E-05 | global batch size: 256 | lm loss: 1.937086E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.663 | TFLOPs: 38.95 | 15: iteration 79470/ 125429 | consumed samples: 20344320 | consumed tokens: 41665167360 | elapsed time per iteration (s): 1.03 | learning rate: 7.429E-05 | global batch size: 256 | lm loss: 1.972910E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.477 | TFLOPs: 40.90 | 15: iteration 79480/ 125429 | consumed samples: 20346880 | consumed tokens: 41670410240 | elapsed time per iteration (s): 1.06 | learning rate: 7.427E-05 | global batch size: 256 | lm loss: 1.919707E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.973 | TFLOPs: 39.99 | 15: iteration 79490/ 125429 | consumed samples: 20349440 | consumed tokens: 41675653120 | elapsed time per iteration (s): 1.04 | learning rate: 7.424E-05 | global batch size: 256 | lm loss: 1.938860E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.284 | TFLOPs: 40.87 | 15: iteration 79500/ 125429 | consumed samples: 20352000 | consumed tokens: 41680896000 | elapsed time per iteration (s): 1.05 | learning rate: 7.422E-05 | global batch size: 256 | lm loss: 1.941951E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.668 | TFLOPs: 40.10 | 15: iteration 79510/ 125429 | consumed samples: 20354560 | consumed tokens: 41686138880 | elapsed time per iteration (s): 1.04 | learning rate: 7.420E-05 | global batch size: 256 | lm loss: 1.911837E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.600 | TFLOPs: 40.75 | 15: iteration 79520/ 125429 | consumed samples: 20357120 | consumed tokens: 41691381760 | elapsed time per iteration (s): 1.02 | learning rate: 7.418E-05 | global batch size: 256 | lm loss: 1.954316E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.097 | TFLOPs: 41.50 | 15: iteration 79530/ 125429 | consumed samples: 20359680 | consumed tokens: 41696624640 | elapsed time per iteration (s): 1.02 | learning rate: 7.416E-05 | global batch size: 256 | lm loss: 1.915198E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.829 | TFLOPs: 41.29 | 15: iteration 79540/ 125429 | consumed samples: 20362240 | consumed tokens: 41701867520 | elapsed time per iteration (s): 1.03 | learning rate: 7.414E-05 | global batch size: 256 | lm loss: 1.943262E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.674 | TFLOPs: 41.26 | 15: iteration 79550/ 125429 | consumed samples: 20364800 | consumed tokens: 41707110400 | elapsed time per iteration (s): 1.02 | learning rate: 7.412E-05 | global batch size: 256 | lm loss: 1.949432E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.461 | TFLOPs: 41.39 | 15: iteration 79560/ 125429 | consumed samples: 20367360 | consumed tokens: 41712353280 | elapsed time per iteration (s): 1.04 | learning rate: 7.410E-05 | global batch size: 256 | lm loss: 1.950382E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.069 | TFLOPs: 40.83 | 15: iteration 79570/ 125429 | consumed samples: 20369920 | consumed tokens: 41717596160 | elapsed time per iteration (s): 1.02 | learning rate: 7.408E-05 | global batch size: 256 | lm loss: 1.945493E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.324 | TFLOPs: 41.37 | 15: iteration 79580/ 125429 | consumed samples: 20372480 | consumed tokens: 41722839040 | elapsed time per iteration (s): 1.06 | learning rate: 7.406E-05 | global batch size: 256 | lm loss: 1.967839E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.396 | TFLOPs: 39.89 | 15: iteration 79590/ 125429 | consumed samples: 20375040 | consumed tokens: 41728081920 | elapsed time per iteration (s): 1.04 | learning rate: 7.404E-05 | global batch size: 256 | lm loss: 1.963395E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.071 | TFLOPs: 40.83 | 15: iteration 79600/ 125429 | consumed samples: 20377600 | consumed tokens: 41733324800 | elapsed time per iteration (s): 1.05 | learning rate: 7.402E-05 | global batch size: 256 | lm loss: 1.944232E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.914 | TFLOPs: 40.14 | 15: iteration 79610/ 125429 | consumed samples: 20380160 | consumed tokens: 41738567680 | elapsed time per iteration (s): 1.03 | learning rate: 7.399E-05 | global batch size: 256 | lm loss: 1.968398E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.411 | TFLOPs: 41.22 | 15: iteration 79620/ 125429 | consumed samples: 20382720 | consumed tokens: 41743810560 | elapsed time per iteration (s): 1.07 | learning rate: 7.397E-05 | global batch size: 256 | lm loss: 1.961515E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.131 | TFLOPs: 39.68 | 15: iteration 79630/ 125429 | consumed samples: 20385280 | consumed tokens: 41749053440 | elapsed time per iteration (s): 1.03 | learning rate: 7.395E-05 | global batch size: 256 | lm loss: 1.932465E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.713 | TFLOPs: 41.27 | 15: iteration 79640/ 125429 | consumed samples: 20387840 | consumed tokens: 41754296320 | elapsed time per iteration (s): 1.04 | learning rate: 7.393E-05 | global batch size: 256 | lm loss: 1.974022E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.901 | TFLOPs: 40.64 | 15: iteration 79650/ 125429 | consumed samples: 20390400 | consumed tokens: 41759539200 | elapsed time per iteration (s): 1.05 | learning rate: 7.391E-05 | global batch size: 256 | lm loss: 1.944799E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.728 | TFLOPs: 40.44 | 15: iteration 79660/ 125429 | consumed samples: 20392960 | consumed tokens: 41764782080 | elapsed time per iteration (s): 1.03 | learning rate: 7.389E-05 | global batch size: 256 | lm loss: 1.951794E+00 | grad norm: 0.133 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.029 | TFLOPs: 41.15 | 15: iteration 79670/ 125429 | consumed samples: 20395520 | consumed tokens: 41770024960 | elapsed time per iteration (s): 1.13 | learning rate: 7.387E-05 | global batch size: 256 | lm loss: 1.955694E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.928 | TFLOPs: 37.34 | 15: iteration 79680/ 125429 | consumed samples: 20398080 | consumed tokens: 41775267840 | elapsed time per iteration (s): 1.04 | learning rate: 7.385E-05 | global batch size: 256 | lm loss: 1.920039E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.263 | TFLOPs: 40.70 | 15: iteration 79690/ 125429 | consumed samples: 20400640 | consumed tokens: 41780510720 | elapsed time per iteration (s): 1.03 | learning rate: 7.383E-05 | global batch size: 256 | lm loss: 1.950684E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.195 | TFLOPs: 41.02 | 15: iteration 79700/ 125429 | consumed samples: 20403200 | consumed tokens: 41785753600 | elapsed time per iteration (s): 1.02 | learning rate: 7.381E-05 | global batch size: 256 | lm loss: 1.948343E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.159 | TFLOPs: 41.51 | 15: iteration 79710/ 125429 | consumed samples: 20405760 | consumed tokens: 41790996480 | elapsed time per iteration (s): 1.03 | learning rate: 7.379E-05 | global batch size: 256 | lm loss: 1.956588E+00 | grad norm: 0.134 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.370 | TFLOPs: 41.21 | 15: iteration 79720/ 125429 | consumed samples: 20408320 | consumed tokens: 41796239360 | elapsed time per iteration (s): 1.04 | learning rate: 7.376E-05 | global batch size: 256 | lm loss: 1.910443E+00 | grad norm: 0.131 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.922 | TFLOPs: 40.81 | 15: iteration 79730/ 125429 | consumed samples: 20410880 | consumed tokens: 41801482240 | elapsed time per iteration (s): 1.05 | learning rate: 7.374E-05 | global batch size: 256 | lm loss: 1.922615E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.321 | TFLOPs: 40.38 | 15: iteration 79740/ 125429 | consumed samples: 20413440 | consumed tokens: 41806725120 | elapsed time per iteration (s): 1.03 | learning rate: 7.372E-05 | global batch size: 256 | lm loss: 1.913883E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.776 | TFLOPs: 41.11 | 15: iteration 79750/ 125429 | consumed samples: 20416000 | consumed tokens: 41811968000 | elapsed time per iteration (s): 1.03 | learning rate: 7.370E-05 | global batch size: 256 | lm loss: 1.961057E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.736 | TFLOPs: 41.27 | 15: iteration 79760/ 125429 | consumed samples: 20418560 | consumed tokens: 41817210880 | elapsed time per iteration (s): 1.03 | learning rate: 7.368E-05 | global batch size: 256 | lm loss: 1.966356E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.375 | TFLOPs: 41.05 | 15: iteration 79770/ 125429 | consumed samples: 20421120 | consumed tokens: 41822453760 | elapsed time per iteration (s): 1.05 | learning rate: 7.366E-05 | global batch size: 256 | lm loss: 1.951341E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.821 | TFLOPs: 40.29 | 15: iteration 79780/ 125429 | consumed samples: 20423680 | consumed tokens: 41827696640 | elapsed time per iteration (s): 1.06 | learning rate: 7.364E-05 | global batch size: 256 | lm loss: 1.943420E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.736 | TFLOPs: 39.78 | 15: iteration 79790/ 125429 | consumed samples: 20426240 | consumed tokens: 41832939520 | elapsed time per iteration (s): 1.06 | learning rate: 7.362E-05 | global batch size: 256 | lm loss: 1.931790E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.064 | TFLOPs: 40.00 | 15: iteration 79800/ 125429 | consumed samples: 20428800 | consumed tokens: 41838182400 | elapsed time per iteration (s): 1.03 | learning rate: 7.360E-05 | global batch size: 256 | lm loss: 1.946007E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.789 | TFLOPs: 41.11 | 15: iteration 79810/ 125429 | consumed samples: 20431360 | consumed tokens: 41843425280 | elapsed time per iteration (s): 1.02 | learning rate: 7.358E-05 | global batch size: 256 | lm loss: 1.939183E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.551 | TFLOPs: 41.41 | 15: iteration 79820/ 125429 | consumed samples: 20433920 | consumed tokens: 41848668160 | elapsed time per iteration (s): 1.04 | learning rate: 7.356E-05 | global batch size: 256 | lm loss: 1.957000E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.802 | TFLOPs: 40.79 | 15: iteration 79830/ 125429 | consumed samples: 20436480 | consumed tokens: 41853911040 | elapsed time per iteration (s): 1.03 | learning rate: 7.354E-05 | global batch size: 256 | lm loss: 1.951710E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.397 | TFLOPs: 41.21 | 15: iteration 79840/ 125429 | consumed samples: 20439040 | consumed tokens: 41859153920 | elapsed time per iteration (s): 1.04 | learning rate: 7.352E-05 | global batch size: 256 | lm loss: 1.926803E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.091 | TFLOPs: 40.83 | 15: iteration 79850/ 125429 | consumed samples: 20441600 | consumed tokens: 41864396800 | elapsed time per iteration (s): 1.02 | learning rate: 7.349E-05 | global batch size: 256 | lm loss: 1.910014E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.286 | TFLOPs: 41.36 | 15: iteration 79860/ 125429 | consumed samples: 20444160 | consumed tokens: 41869639680 | elapsed time per iteration (s): 1.05 | learning rate: 7.347E-05 | global batch size: 256 | lm loss: 1.908780E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.093 | TFLOPs: 40.34 | 15: iteration 79870/ 125429 | consumed samples: 20446720 | consumed tokens: 41874882560 | elapsed time per iteration (s): 1.06 | learning rate: 7.345E-05 | global batch size: 256 | lm loss: 1.946086E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.314 | TFLOPs: 40.04 | 15: iteration 79880/ 125429 | consumed samples: 20449280 | consumed tokens: 41880125440 | elapsed time per iteration (s): 1.02 | learning rate: 7.343E-05 | global batch size: 256 | lm loss: 1.933787E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.827 | TFLOPs: 41.45 | 15: iteration 79890/ 125429 | consumed samples: 20451840 | consumed tokens: 41885368320 | elapsed time per iteration (s): 1.05 | learning rate: 7.341E-05 | global batch size: 256 | lm loss: 1.989536E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.174 | TFLOPs: 40.19 | 15: iteration 79900/ 125429 | consumed samples: 20454400 | consumed tokens: 41890611200 | elapsed time per iteration (s): 1.05 | learning rate: 7.339E-05 | global batch size: 256 | lm loss: 1.925082E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.654 | TFLOPs: 40.43 | 15: iteration 79910/ 125429 | consumed samples: 20456960 | consumed tokens: 41895854080 | elapsed time per iteration (s): 1.05 | learning rate: 7.337E-05 | global batch size: 256 | lm loss: 1.919042E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.261 | TFLOPs: 40.37 | 15: iteration 79920/ 125429 | consumed samples: 20459520 | consumed tokens: 41901096960 | elapsed time per iteration (s): 1.04 | learning rate: 7.335E-05 | global batch size: 256 | lm loss: 1.957577E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.473 | TFLOPs: 40.73 | 15: iteration 79930/ 125429 | consumed samples: 20462080 | consumed tokens: 41906339840 | elapsed time per iteration (s): 1.05 | learning rate: 7.333E-05 | global batch size: 256 | lm loss: 1.932589E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.597 | TFLOPs: 40.26 | 15: iteration 79940/ 125429 | consumed samples: 20464640 | consumed tokens: 41911582720 | elapsed time per iteration (s): 1.03 | learning rate: 7.331E-05 | global batch size: 256 | lm loss: 1.918012E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.731 | TFLOPs: 41.27 | 15: iteration 79950/ 125429 | consumed samples: 20467200 | consumed tokens: 41916825600 | elapsed time per iteration (s): 1.03 | learning rate: 7.329E-05 | global batch size: 256 | lm loss: 1.937978E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.854 | TFLOPs: 40.96 | 15: iteration 79960/ 125429 | consumed samples: 20469760 | consumed tokens: 41922068480 | elapsed time per iteration (s): 1.02 | learning rate: 7.327E-05 | global batch size: 256 | lm loss: 1.930145E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.214 | TFLOPs: 41.35 | 15: iteration 79970/ 125429 | consumed samples: 20472320 | consumed tokens: 41927311360 | elapsed time per iteration (s): 1.02 | learning rate: 7.324E-05 | global batch size: 256 | lm loss: 1.930995E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.813 | TFLOPs: 41.45 | 15: iteration 79980/ 125429 | consumed samples: 20474880 | consumed tokens: 41932554240 | elapsed time per iteration (s): 1.05 | learning rate: 7.322E-05 | global batch size: 256 | lm loss: 1.949232E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.197 | TFLOPs: 40.36 | 15: iteration 79990/ 125429 | consumed samples: 20477440 | consumed tokens: 41937797120 | elapsed time per iteration (s): 1.03 | learning rate: 7.320E-05 | global batch size: 256 | lm loss: 1.964237E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.970 | TFLOPs: 40.98 | 0: [2022-11-26 19:44:39,041] [INFO] [logging.py:68:log_dist] [Rank 0] step=80000, skipped=0, lr=[7.31822952285044e-05, 7.31822952285044e-05, 7.31822952285044e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 80000/ 125429 | consumed samples: 20480000 | consumed tokens: 41943040000 | elapsed time per iteration (s): 1.04 | learning rate: 7.318E-05 | global batch size: 256 | lm loss: 1.953826E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.769 | TFLOPs: 40.78 | 0: steps: 80000 loss: 1.9303 iter time (s): 1.051 samples/sec: 243.580 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 80000 | lm loss value: 1.898695E+00 | lm loss PPL: 6.677176E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 80000 to checkpoints_1b5 0: [2022-11-26 19:44:39,398] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step80000 is begin to save! 0: [2022-11-26 19:44:39,406] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_01-model_00-model_states.pt... 0: [2022-11-26 19:44:39,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_01-model_00-model_states.pt. 0: [2022-11-26 19:44:39,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_03-model_00-model_states.pt... 0: [2022-11-26 19:44:39,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_03-model_00-model_states.pt. 0: [2022-11-26 19:44:39,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_04-model_00-model_states.pt... 0: [2022-11-26 19:44:39,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_04-model_00-model_states.pt. 0: [2022-11-26 19:44:39,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_05-model_00-model_states.pt... 0: [2022-11-26 19:44:39,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_05-model_00-model_states.pt. 0: [2022-11-26 19:44:39,996] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_06-model_00-model_states.pt... 0: [2022-11-26 19:44:40,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_06-model_00-model_states.pt. 0: [2022-11-26 19:44:40,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_07-model_00-model_states.pt... 0: [2022-11-26 19:44:40,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_07-model_00-model_states.pt. 0: [2022-11-26 19:44:40,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_08-model_00-model_states.pt... 0: [2022-11-26 19:44:40,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_08-model_00-model_states.pt. 0: [2022-11-26 19:44:40,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_09-model_00-model_states.pt... 0: [2022-11-26 19:44:40,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_09-model_00-model_states.pt. 0: [2022-11-26 19:44:40,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_10-model_00-model_states.pt... 0: [2022-11-26 19:44:40,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_10-model_00-model_states.pt. 0: [2022-11-26 19:44:40,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_11-model_00-model_states.pt... 0: [2022-11-26 19:44:40,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_11-model_00-model_states.pt. 0: [2022-11-26 19:44:40,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_12-model_00-model_states.pt... 0: [2022-11-26 19:44:40,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_12-model_00-model_states.pt. 0: [2022-11-26 19:44:40,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_13-model_00-model_states.pt... 0: [2022-11-26 19:44:40,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_13-model_00-model_states.pt. 0: [2022-11-26 19:44:40,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_14-model_00-model_states.pt... 0: [2022-11-26 19:44:40,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_14-model_00-model_states.pt. 0: [2022-11-26 19:44:40,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_15-model_00-model_states.pt... 0: [2022-11-26 19:44:41,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_15-model_00-model_states.pt. 0: [2022-11-26 19:44:41,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_16-model_00-model_states.pt... 0: [2022-11-26 19:44:41,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_16-model_00-model_states.pt. 0: [2022-11-26 19:44:41,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_17-model_00-model_states.pt... 0: [2022-11-26 19:44:41,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_17-model_00-model_states.pt. 0: [2022-11-26 19:44:41,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_18-model_00-model_states.pt... 0: [2022-11-26 19:44:41,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_18-model_00-model_states.pt. 0: [2022-11-26 19:44:41,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_19-model_00-model_states.pt... 0: [2022-11-26 19:44:41,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_19-model_00-model_states.pt. 0: [2022-11-26 19:44:41,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_20-model_00-model_states.pt... 0: [2022-11-26 19:44:41,646] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_20-model_00-model_states.pt. 0: [2022-11-26 19:44:41,646] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_21-model_00-model_states.pt... 0: [2022-11-26 19:44:41,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_21-model_00-model_states.pt. 0: [2022-11-26 19:44:41,755] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_22-model_00-model_states.pt... 0: [2022-11-26 19:44:41,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_22-model_00-model_states.pt. 0: [2022-11-26 19:44:41,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_23-model_00-model_states.pt... 0: [2022-11-26 19:44:41,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_23-model_00-model_states.pt. 0: [2022-11-26 19:44:41,971] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_24-model_00-model_states.pt... 0: [2022-11-26 19:44:42,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_24-model_00-model_states.pt. 0: [2022-11-26 19:44:42,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_25-model_00-model_states.pt... 0: [2022-11-26 19:44:42,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_25-model_00-model_states.pt. 0: [2022-11-26 19:44:42,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_26-model_00-model_states.pt... 0: [2022-11-26 19:44:42,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_26-model_00-model_states.pt. 0: [2022-11-26 19:44:42,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_27-model_00-model_states.pt... 0: [2022-11-26 19:44:42,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_27-model_00-model_states.pt. 0: [2022-11-26 19:44:42,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_28-model_00-model_states.pt... 0: [2022-11-26 19:44:42,511] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_28-model_00-model_states.pt. 0: [2022-11-26 19:44:42,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_29-model_00-model_states.pt... 0: [2022-11-26 19:44:42,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_29-model_00-model_states.pt. 0: [2022-11-26 19:44:42,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_30-model_00-model_states.pt... 0: [2022-11-26 19:44:42,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_30-model_00-model_states.pt. 0: [2022-11-26 19:44:42,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/layer_32-model_00-model_states.pt... 0: [2022-11-26 19:44:42,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/layer_32-model_00-model_states.pt. 0: [2022-11-26 19:44:42,731] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step80000/mp_rank_00_model_states.pt 0: [2022-11-26 19:44:42,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/mp_rank_00_model_states.pt... 0: [2022-11-26 19:44:42,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/mp_rank_00_model_states.pt. 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 14: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 19:44:42,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step80000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 0: [2022-11-26 19:44:42,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 19:44:42,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:44:42,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 19:44:42,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:42,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 19:44:42,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 19:44:42,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:42,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 19:44:42,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:44:42,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 19:44:42,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 19:44:42,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:44:42,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:44:42,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:44:42,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:44:42,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 8: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:44:42,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:44:42,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:44:42,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 19:44:42,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:42,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:44:42,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:44:42,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:44:42,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 19:44:42,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 19:44:42,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 19:44:42,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 19:44:42,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 19:44:42,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 2: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 2: [2022-11-26 19:44:42,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 2: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:44:42,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 19:44:42,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 19:44:42,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 19:44:42,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 19:44:42,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:42,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:42,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 10: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 19:44:42,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 13: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 3: [2022-11-26 19:44:42,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 19:44:42,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 19:44:42,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 19:44:42,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 19:44:42,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:42,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:44:42,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 15: [2022-11-26 19:44:42,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 19:44:42,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 19:44:42,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 7: [2022-11-26 19:44:42,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 19:44:42,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 19:44:42,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 19:44:42,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 19:44:42,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 1: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 19:44:42,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 19:44:42,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 8: [2022-11-26 19:44:42,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 19:44:42,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 12: [2022-11-26 19:44:42,982] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 19:44:42,982] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 14: [2022-11-26 19:44:42,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 19:44:42,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 19:44:42,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 5: [2022-11-26 19:44:42,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 19:44:42,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 19:44:42,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 6: [2022-11-26 19:44:43,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 4: [2022-11-26 19:44:42,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 19:44:42,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 19:44:42,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 11: [2022-11-26 19:44:43,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 9: [2022-11-26 19:44:43,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 19:44:43,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: [2022-11-26 19:44:43,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step80000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 19:44:43,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step80000 is ready now! 0: successfully saved checkpoint at iteration 80000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3691.59 15: iteration 80010/ 125429 | consumed samples: 20482560 | consumed tokens: 41948282880 | elapsed time per iteration (s): 1.43 | learning rate: 7.316E-05 | global batch size: 256 | lm loss: 1.945444E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.501 | TFLOPs: 29.50 | 15: iteration 80020/ 125429 | consumed samples: 20485120 | consumed tokens: 41953525760 | elapsed time per iteration (s): 1.04 | learning rate: 7.314E-05 | global batch size: 256 | lm loss: 1.933138E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.245 | TFLOPs: 40.86 | 15: iteration 80030/ 125429 | consumed samples: 20487680 | consumed tokens: 41958768640 | elapsed time per iteration (s): 1.04 | learning rate: 7.312E-05 | global batch size: 256 | lm loss: 1.953107E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.627 | TFLOPs: 40.76 | 15: iteration 80040/ 125429 | consumed samples: 20490240 | consumed tokens: 41964011520 | elapsed time per iteration (s): 1.02 | learning rate: 7.310E-05 | global batch size: 256 | lm loss: 1.957571E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.903 | TFLOPs: 41.30 | 15: iteration 80050/ 125429 | consumed samples: 20492800 | consumed tokens: 41969254400 | elapsed time per iteration (s): 1.05 | learning rate: 7.308E-05 | global batch size: 256 | lm loss: 1.950507E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.391 | TFLOPs: 40.22 | 15: iteration 80060/ 125429 | consumed samples: 20495360 | consumed tokens: 41974497280 | elapsed time per iteration (s): 1.07 | learning rate: 7.306E-05 | global batch size: 256 | lm loss: 1.932902E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.156 | TFLOPs: 39.52 | 15: iteration 80070/ 125429 | consumed samples: 20497920 | consumed tokens: 41979740160 | elapsed time per iteration (s): 1.03 | learning rate: 7.304E-05 | global batch size: 256 | lm loss: 1.944414E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.747 | TFLOPs: 41.11 | 15: iteration 80080/ 125429 | consumed samples: 20500480 | consumed tokens: 41984983040 | elapsed time per iteration (s): 1.06 | learning rate: 7.302E-05 | global batch size: 256 | lm loss: 1.945027E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.839 | TFLOPs: 39.97 | 15: iteration 80090/ 125429 | consumed samples: 20503040 | consumed tokens: 41990225920 | elapsed time per iteration (s): 1.16 | learning rate: 7.300E-05 | global batch size: 256 | lm loss: 1.941718E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.595 | TFLOPs: 36.46 | 15: iteration 80100/ 125429 | consumed samples: 20505600 | consumed tokens: 41995468800 | elapsed time per iteration (s): 1.03 | learning rate: 7.297E-05 | global batch size: 256 | lm loss: 1.942287E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.310 | TFLOPs: 41.04 | 15: iteration 80110/ 125429 | consumed samples: 20508160 | consumed tokens: 42000711680 | elapsed time per iteration (s): 1.04 | learning rate: 7.295E-05 | global batch size: 256 | lm loss: 1.944216E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.524 | TFLOPs: 40.57 | 15: iteration 80120/ 125429 | consumed samples: 20510720 | consumed tokens: 42005954560 | elapsed time per iteration (s): 1.03 | learning rate: 7.293E-05 | global batch size: 256 | lm loss: 1.952264E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.530 | TFLOPs: 41.24 | 15: iteration 80130/ 125429 | consumed samples: 20513280 | consumed tokens: 42011197440 | elapsed time per iteration (s): 1.03 | learning rate: 7.291E-05 | global batch size: 256 | lm loss: 1.964135E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.568 | TFLOPs: 40.91 | 15: iteration 80140/ 125429 | consumed samples: 20515840 | consumed tokens: 42016440320 | elapsed time per iteration (s): 1.03 | learning rate: 7.289E-05 | global batch size: 256 | lm loss: 1.919791E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.004 | TFLOPs: 40.98 | 15: iteration 80150/ 125429 | consumed samples: 20518400 | consumed tokens: 42021683200 | elapsed time per iteration (s): 1.04 | learning rate: 7.287E-05 | global batch size: 256 | lm loss: 1.928842E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.858 | TFLOPs: 40.80 | 15: iteration 80160/ 125429 | consumed samples: 20520960 | consumed tokens: 42026926080 | elapsed time per iteration (s): 1.04 | learning rate: 7.285E-05 | global batch size: 256 | lm loss: 1.948279E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.480 | TFLOPs: 40.73 | 15: iteration 80170/ 125429 | consumed samples: 20523520 | consumed tokens: 42032168960 | elapsed time per iteration (s): 1.04 | learning rate: 7.283E-05 | global batch size: 256 | lm loss: 1.959120E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.232 | TFLOPs: 40.69 | 15: iteration 80180/ 125429 | consumed samples: 20526080 | consumed tokens: 42037411840 | elapsed time per iteration (s): 1.02 | learning rate: 7.281E-05 | global batch size: 256 | lm loss: 1.927807E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.295 | TFLOPs: 41.53 | 15: iteration 80190/ 125429 | consumed samples: 20528640 | consumed tokens: 42042654720 | elapsed time per iteration (s): 1.04 | learning rate: 7.279E-05 | global batch size: 256 | lm loss: 1.923710E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.213 | TFLOPs: 40.85 | 15: iteration 80200/ 125429 | consumed samples: 20531200 | consumed tokens: 42047897600 | elapsed time per iteration (s): 1.06 | learning rate: 7.277E-05 | global batch size: 256 | lm loss: 1.949646E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.366 | TFLOPs: 40.05 | 15: iteration 80210/ 125429 | consumed samples: 20533760 | consumed tokens: 42053140480 | elapsed time per iteration (s): 1.03 | learning rate: 7.275E-05 | global batch size: 256 | lm loss: 1.930470E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.336 | TFLOPs: 41.04 | 15: iteration 80220/ 125429 | consumed samples: 20536320 | consumed tokens: 42058383360 | elapsed time per iteration (s): 1.02 | learning rate: 7.273E-05 | global batch size: 256 | lm loss: 1.946511E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.915 | TFLOPs: 41.30 | 15: iteration 80230/ 125429 | consumed samples: 20538880 | consumed tokens: 42063626240 | elapsed time per iteration (s): 1.04 | learning rate: 7.271E-05 | global batch size: 256 | lm loss: 1.940608E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.927 | TFLOPs: 40.64 | 15: iteration 80240/ 125429 | consumed samples: 20541440 | consumed tokens: 42068869120 | elapsed time per iteration (s): 1.05 | learning rate: 7.268E-05 | global batch size: 256 | lm loss: 1.917575E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.132 | TFLOPs: 40.34 | 15: iteration 80250/ 125429 | consumed samples: 20544000 | consumed tokens: 42074112000 | elapsed time per iteration (s): 1.04 | learning rate: 7.266E-05 | global batch size: 256 | lm loss: 1.947679E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.581 | TFLOPs: 40.75 | 15: iteration 80260/ 125429 | consumed samples: 20546560 | consumed tokens: 42079354880 | elapsed time per iteration (s): 1.05 | learning rate: 7.264E-05 | global batch size: 256 | lm loss: 1.962563E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.757 | TFLOPs: 40.45 | 15: iteration 80270/ 125429 | consumed samples: 20549120 | consumed tokens: 42084597760 | elapsed time per iteration (s): 1.06 | learning rate: 7.262E-05 | global batch size: 256 | lm loss: 1.926992E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.726 | TFLOPs: 39.95 | 15: iteration 80280/ 125429 | consumed samples: 20551680 | consumed tokens: 42089840640 | elapsed time per iteration (s): 1.04 | learning rate: 7.260E-05 | global batch size: 256 | lm loss: 1.944539E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.274 | TFLOPs: 40.70 | 15: iteration 80290/ 125429 | consumed samples: 20554240 | consumed tokens: 42095083520 | elapsed time per iteration (s): 1.03 | learning rate: 7.258E-05 | global batch size: 256 | lm loss: 1.941094E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.896 | TFLOPs: 41.13 | 15: iteration 80300/ 125429 | consumed samples: 20556800 | consumed tokens: 42100326400 | elapsed time per iteration (s): 1.03 | learning rate: 7.256E-05 | global batch size: 256 | lm loss: 1.968673E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.911 | TFLOPs: 40.97 | 15: iteration 80310/ 125429 | consumed samples: 20559360 | consumed tokens: 42105569280 | elapsed time per iteration (s): 1.04 | learning rate: 7.254E-05 | global batch size: 256 | lm loss: 1.932987E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.215 | TFLOPs: 40.52 | 15: iteration 80320/ 125429 | consumed samples: 20561920 | consumed tokens: 42110812160 | elapsed time per iteration (s): 1.03 | learning rate: 7.252E-05 | global batch size: 256 | lm loss: 1.935018E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.664 | TFLOPs: 40.93 | 15: iteration 80330/ 125429 | consumed samples: 20564480 | consumed tokens: 42116055040 | elapsed time per iteration (s): 1.03 | learning rate: 7.250E-05 | global batch size: 256 | lm loss: 1.915672E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.494 | TFLOPs: 40.90 | 15: iteration 80340/ 125429 | consumed samples: 20567040 | consumed tokens: 42121297920 | elapsed time per iteration (s): 1.05 | learning rate: 7.248E-05 | global batch size: 256 | lm loss: 1.943233E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.090 | TFLOPs: 40.34 | 15: iteration 80350/ 125429 | consumed samples: 20569600 | consumed tokens: 42126540800 | elapsed time per iteration (s): 1.04 | learning rate: 7.246E-05 | global batch size: 256 | lm loss: 1.954893E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.506 | TFLOPs: 40.74 | 15: iteration 80360/ 125429 | consumed samples: 20572160 | consumed tokens: 42131783680 | elapsed time per iteration (s): 1.07 | learning rate: 7.244E-05 | global batch size: 256 | lm loss: 1.950027E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.805 | TFLOPs: 39.63 | 15: iteration 80370/ 125429 | consumed samples: 20574720 | consumed tokens: 42137026560 | elapsed time per iteration (s): 1.03 | learning rate: 7.242E-05 | global batch size: 256 | lm loss: 1.940925E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.019 | TFLOPs: 40.99 | 15: iteration 80380/ 125429 | consumed samples: 20577280 | consumed tokens: 42142269440 | elapsed time per iteration (s): 1.04 | learning rate: 7.239E-05 | global batch size: 256 | lm loss: 1.945919E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.626 | TFLOPs: 40.76 | 15: iteration 80390/ 125429 | consumed samples: 20579840 | consumed tokens: 42147512320 | elapsed time per iteration (s): 1.03 | learning rate: 7.237E-05 | global batch size: 256 | lm loss: 1.943801E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.800 | TFLOPs: 41.12 | 15: iteration 80400/ 125429 | consumed samples: 20582400 | consumed tokens: 42152755200 | elapsed time per iteration (s): 1.03 | learning rate: 7.235E-05 | global batch size: 256 | lm loss: 1.918643E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.385 | TFLOPs: 40.88 | 15: iteration 80410/ 125429 | consumed samples: 20584960 | consumed tokens: 42157998080 | elapsed time per iteration (s): 1.04 | learning rate: 7.233E-05 | global batch size: 256 | lm loss: 1.941940E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.731 | TFLOPs: 40.77 | 15: iteration 80420/ 125429 | consumed samples: 20587520 | consumed tokens: 42163240960 | elapsed time per iteration (s): 1.06 | learning rate: 7.231E-05 | global batch size: 256 | lm loss: 1.945865E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.355 | TFLOPs: 40.05 | 15: iteration 80430/ 125429 | consumed samples: 20590080 | consumed tokens: 42168483840 | elapsed time per iteration (s): 1.03 | learning rate: 7.229E-05 | global batch size: 256 | lm loss: 1.965702E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.387 | TFLOPs: 40.88 | 15: iteration 80440/ 125429 | consumed samples: 20592640 | consumed tokens: 42173726720 | elapsed time per iteration (s): 1.24 | learning rate: 7.227E-05 | global batch size: 256 | lm loss: 1.937042E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 206.520 | TFLOPs: 34.13 | 15: iteration 80450/ 125429 | consumed samples: 20595200 | consumed tokens: 42178969600 | elapsed time per iteration (s): 1.08 | learning rate: 7.225E-05 | global batch size: 256 | lm loss: 1.917344E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.706 | TFLOPs: 39.12 | 15: iteration 80460/ 125429 | consumed samples: 20597760 | consumed tokens: 42184212480 | elapsed time per iteration (s): 1.05 | learning rate: 7.223E-05 | global batch size: 256 | lm loss: 1.929232E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.331 | TFLOPs: 40.38 | 15: iteration 80470/ 125429 | consumed samples: 20600320 | consumed tokens: 42189455360 | elapsed time per iteration (s): 1.03 | learning rate: 7.221E-05 | global batch size: 256 | lm loss: 1.935538E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.073 | TFLOPs: 41.00 | 15: iteration 80480/ 125429 | consumed samples: 20602880 | consumed tokens: 42194698240 | elapsed time per iteration (s): 1.05 | learning rate: 7.219E-05 | global batch size: 256 | lm loss: 1.924138E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.972 | TFLOPs: 40.15 | 15: iteration 80490/ 125429 | consumed samples: 20605440 | consumed tokens: 42199941120 | elapsed time per iteration (s): 1.06 | learning rate: 7.217E-05 | global batch size: 256 | lm loss: 1.923402E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.446 | TFLOPs: 39.74 | 15: iteration 80500/ 125429 | consumed samples: 20608000 | consumed tokens: 42205184000 | elapsed time per iteration (s): 1.04 | learning rate: 7.215E-05 | global batch size: 256 | lm loss: 1.929335E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.099 | TFLOPs: 40.50 | 15: iteration 80510/ 125429 | consumed samples: 20610560 | consumed tokens: 42210426880 | elapsed time per iteration (s): 1.04 | learning rate: 7.213E-05 | global batch size: 256 | lm loss: 1.951300E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.997 | TFLOPs: 40.49 | 15: iteration 80520/ 125429 | consumed samples: 20613120 | consumed tokens: 42215669760 | elapsed time per iteration (s): 1.02 | learning rate: 7.211E-05 | global batch size: 256 | lm loss: 1.959981E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.534 | TFLOPs: 41.57 | 15: iteration 80530/ 125429 | consumed samples: 20615680 | consumed tokens: 42220912640 | elapsed time per iteration (s): 1.03 | learning rate: 7.208E-05 | global batch size: 256 | lm loss: 1.924224E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.059 | TFLOPs: 41.16 | 15: iteration 80540/ 125429 | consumed samples: 20618240 | consumed tokens: 42226155520 | elapsed time per iteration (s): 1.04 | learning rate: 7.206E-05 | global batch size: 256 | lm loss: 1.943426E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.347 | TFLOPs: 40.55 | 15: iteration 80550/ 125429 | consumed samples: 20620800 | consumed tokens: 42231398400 | elapsed time per iteration (s): 1.06 | learning rate: 7.204E-05 | global batch size: 256 | lm loss: 1.957564E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.973 | TFLOPs: 39.99 | 15: iteration 80560/ 125429 | consumed samples: 20623360 | consumed tokens: 42236641280 | elapsed time per iteration (s): 1.05 | learning rate: 7.202E-05 | global batch size: 256 | lm loss: 1.946798E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.898 | TFLOPs: 40.47 | 15: iteration 80570/ 125429 | consumed samples: 20625920 | consumed tokens: 42241884160 | elapsed time per iteration (s): 1.03 | learning rate: 7.200E-05 | global batch size: 256 | lm loss: 1.962894E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.980 | TFLOPs: 40.98 | 15: iteration 80580/ 125429 | consumed samples: 20628480 | consumed tokens: 42247127040 | elapsed time per iteration (s): 1.07 | learning rate: 7.198E-05 | global batch size: 256 | lm loss: 1.951523E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.673 | TFLOPs: 39.61 | 15: iteration 80590/ 125429 | consumed samples: 20631040 | consumed tokens: 42252369920 | elapsed time per iteration (s): 1.03 | learning rate: 7.196E-05 | global batch size: 256 | lm loss: 1.959144E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.316 | TFLOPs: 41.04 | 15: iteration 80600/ 125429 | consumed samples: 20633600 | consumed tokens: 42257612800 | elapsed time per iteration (s): 1.05 | learning rate: 7.194E-05 | global batch size: 256 | lm loss: 1.955809E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.278 | TFLOPs: 40.37 | 15: iteration 80610/ 125429 | consumed samples: 20636160 | consumed tokens: 42262855680 | elapsed time per iteration (s): 1.05 | learning rate: 7.192E-05 | global batch size: 256 | lm loss: 1.932188E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.531 | TFLOPs: 40.41 | 15: iteration 80620/ 125429 | consumed samples: 20638720 | consumed tokens: 42268098560 | elapsed time per iteration (s): 1.07 | learning rate: 7.190E-05 | global batch size: 256 | lm loss: 1.944836E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.774 | TFLOPs: 39.46 | 15: iteration 80630/ 125429 | consumed samples: 20641280 | consumed tokens: 42273341440 | elapsed time per iteration (s): 1.02 | learning rate: 7.188E-05 | global batch size: 256 | lm loss: 1.945242E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.536 | TFLOPs: 41.40 | 15: iteration 80640/ 125429 | consumed samples: 20643840 | consumed tokens: 42278584320 | elapsed time per iteration (s): 1.05 | learning rate: 7.186E-05 | global batch size: 256 | lm loss: 1.954215E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.470 | TFLOPs: 40.24 | 15: iteration 80650/ 125429 | consumed samples: 20646400 | consumed tokens: 42283827200 | elapsed time per iteration (s): 1.06 | learning rate: 7.184E-05 | global batch size: 256 | lm loss: 1.954363E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.966 | TFLOPs: 39.99 | 15: iteration 80660/ 125429 | consumed samples: 20648960 | consumed tokens: 42289070080 | elapsed time per iteration (s): 1.05 | learning rate: 7.182E-05 | global batch size: 256 | lm loss: 1.945871E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.449 | TFLOPs: 40.40 | 15: iteration 80670/ 125429 | consumed samples: 20651520 | consumed tokens: 42294312960 | elapsed time per iteration (s): 1.05 | learning rate: 7.180E-05 | global batch size: 256 | lm loss: 1.955237E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.672 | TFLOPs: 40.27 | 15: iteration 80680/ 125429 | consumed samples: 20654080 | consumed tokens: 42299555840 | elapsed time per iteration (s): 1.05 | learning rate: 7.177E-05 | global batch size: 256 | lm loss: 1.936766E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.566 | TFLOPs: 40.42 | 15: iteration 80690/ 125429 | consumed samples: 20656640 | consumed tokens: 42304798720 | elapsed time per iteration (s): 1.06 | learning rate: 7.175E-05 | global batch size: 256 | lm loss: 1.976791E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.734 | TFLOPs: 39.95 | 15: iteration 80700/ 125429 | consumed samples: 20659200 | consumed tokens: 42310041600 | elapsed time per iteration (s): 1.02 | learning rate: 7.173E-05 | global batch size: 256 | lm loss: 1.961241E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.994 | TFLOPs: 41.48 | 15: iteration 80710/ 125429 | consumed samples: 20661760 | consumed tokens: 42315284480 | elapsed time per iteration (s): 1.04 | learning rate: 7.171E-05 | global batch size: 256 | lm loss: 1.960095E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.011 | TFLOPs: 40.66 | 15: iteration 80720/ 125429 | consumed samples: 20664320 | consumed tokens: 42320527360 | elapsed time per iteration (s): 1.02 | learning rate: 7.169E-05 | global batch size: 256 | lm loss: 1.941650E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.234 | TFLOPs: 41.35 | 15: iteration 80730/ 125429 | consumed samples: 20666880 | consumed tokens: 42325770240 | elapsed time per iteration (s): 1.03 | learning rate: 7.167E-05 | global batch size: 256 | lm loss: 1.964564E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.296 | TFLOPs: 41.20 | 15: iteration 80740/ 125429 | consumed samples: 20669440 | consumed tokens: 42331013120 | elapsed time per iteration (s): 1.03 | learning rate: 7.165E-05 | global batch size: 256 | lm loss: 1.948679E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.489 | TFLOPs: 41.06 | 15: iteration 80750/ 125429 | consumed samples: 20672000 | consumed tokens: 42336256000 | elapsed time per iteration (s): 1.15 | learning rate: 7.163E-05 | global batch size: 256 | lm loss: 1.949878E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.151 | TFLOPs: 36.71 | 15: iteration 80760/ 125429 | consumed samples: 20674560 | consumed tokens: 42341498880 | elapsed time per iteration (s): 1.03 | learning rate: 7.161E-05 | global batch size: 256 | lm loss: 1.937243E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.392 | TFLOPs: 40.88 | 15: iteration 80770/ 125429 | consumed samples: 20677120 | consumed tokens: 42346741760 | elapsed time per iteration (s): 1.04 | learning rate: 7.159E-05 | global batch size: 256 | lm loss: 1.976292E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.618 | TFLOPs: 40.59 | 15: iteration 80780/ 125429 | consumed samples: 20679680 | consumed tokens: 42351984640 | elapsed time per iteration (s): 1.17 | learning rate: 7.157E-05 | global batch size: 256 | lm loss: 1.947480E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.139 | TFLOPs: 36.21 | 15: iteration 80790/ 125429 | consumed samples: 20682240 | consumed tokens: 42357227520 | elapsed time per iteration (s): 1.06 | learning rate: 7.155E-05 | global batch size: 256 | lm loss: 1.935532E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.701 | TFLOPs: 39.78 | 15: iteration 80800/ 125429 | consumed samples: 20684800 | consumed tokens: 42362470400 | elapsed time per iteration (s): 1.05 | learning rate: 7.153E-05 | global batch size: 256 | lm loss: 1.904835E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.110 | TFLOPs: 40.34 | 15: iteration 80810/ 125429 | consumed samples: 20687360 | consumed tokens: 42367713280 | elapsed time per iteration (s): 1.03 | learning rate: 7.151E-05 | global batch size: 256 | lm loss: 1.937224E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.626 | TFLOPs: 40.92 | 15: iteration 80820/ 125429 | consumed samples: 20689920 | consumed tokens: 42372956160 | elapsed time per iteration (s): 1.13 | learning rate: 7.149E-05 | global batch size: 256 | lm loss: 1.922937E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.299 | TFLOPs: 37.40 | 15: iteration 80830/ 125429 | consumed samples: 20692480 | consumed tokens: 42378199040 | elapsed time per iteration (s): 1.04 | learning rate: 7.147E-05 | global batch size: 256 | lm loss: 1.926950E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.134 | TFLOPs: 40.68 | 15: iteration 80840/ 125429 | consumed samples: 20695040 | consumed tokens: 42383441920 | elapsed time per iteration (s): 1.05 | learning rate: 7.145E-05 | global batch size: 256 | lm loss: 1.921903E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.740 | TFLOPs: 40.28 | 15: iteration 80850/ 125429 | consumed samples: 20697600 | consumed tokens: 42388684800 | elapsed time per iteration (s): 1.05 | learning rate: 7.142E-05 | global batch size: 256 | lm loss: 1.965871E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.472 | TFLOPs: 40.24 | 15: iteration 80860/ 125429 | consumed samples: 20700160 | consumed tokens: 42393927680 | elapsed time per iteration (s): 1.03 | learning rate: 7.140E-05 | global batch size: 256 | lm loss: 1.942408E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.338 | TFLOPs: 41.21 | 15: iteration 80870/ 125429 | consumed samples: 20702720 | consumed tokens: 42399170560 | elapsed time per iteration (s): 1.04 | learning rate: 7.138E-05 | global batch size: 256 | lm loss: 1.966044E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.165 | TFLOPs: 40.52 | 15: iteration 80880/ 125429 | consumed samples: 20705280 | consumed tokens: 42404413440 | elapsed time per iteration (s): 1.03 | learning rate: 7.136E-05 | global batch size: 256 | lm loss: 1.929973E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.734 | TFLOPs: 41.11 | 15: iteration 80890/ 125429 | consumed samples: 20707840 | consumed tokens: 42409656320 | elapsed time per iteration (s): 1.02 | learning rate: 7.134E-05 | global batch size: 256 | lm loss: 1.955435E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.180 | TFLOPs: 41.34 | 15: iteration 80900/ 125429 | consumed samples: 20710400 | consumed tokens: 42414899200 | elapsed time per iteration (s): 1.05 | learning rate: 7.132E-05 | global batch size: 256 | lm loss: 1.917052E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.550 | TFLOPs: 40.25 | 15: iteration 80910/ 125429 | consumed samples: 20712960 | consumed tokens: 42420142080 | elapsed time per iteration (s): 1.08 | learning rate: 7.130E-05 | global batch size: 256 | lm loss: 1.934612E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.127 | TFLOPs: 39.02 | 15: iteration 80920/ 125429 | consumed samples: 20715520 | consumed tokens: 42425384960 | elapsed time per iteration (s): 1.04 | learning rate: 7.128E-05 | global batch size: 256 | lm loss: 1.960710E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.476 | TFLOPs: 40.57 | 15: iteration 80930/ 125429 | consumed samples: 20718080 | consumed tokens: 42430627840 | elapsed time per iteration (s): 1.04 | learning rate: 7.126E-05 | global batch size: 256 | lm loss: 1.951031E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.752 | TFLOPs: 40.78 | 15: iteration 80940/ 125429 | consumed samples: 20720640 | consumed tokens: 42435870720 | elapsed time per iteration (s): 1.05 | learning rate: 7.124E-05 | global batch size: 256 | lm loss: 1.947104E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.904 | TFLOPs: 40.31 | 15: iteration 80950/ 125429 | consumed samples: 20723200 | consumed tokens: 42441113600 | elapsed time per iteration (s): 1.03 | learning rate: 7.122E-05 | global batch size: 256 | lm loss: 1.937151E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.260 | TFLOPs: 41.03 | 15: iteration 80960/ 125429 | consumed samples: 20725760 | consumed tokens: 42446356480 | elapsed time per iteration (s): 1.07 | learning rate: 7.120E-05 | global batch size: 256 | lm loss: 1.934807E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.531 | TFLOPs: 39.42 | 15: iteration 80970/ 125429 | consumed samples: 20728320 | consumed tokens: 42451599360 | elapsed time per iteration (s): 1.05 | learning rate: 7.118E-05 | global batch size: 256 | lm loss: 1.939044E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.958 | TFLOPs: 40.32 | 15: iteration 80980/ 125429 | consumed samples: 20730880 | consumed tokens: 42456842240 | elapsed time per iteration (s): 1.15 | learning rate: 7.116E-05 | global batch size: 256 | lm loss: 1.919643E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.071 | TFLOPs: 36.70 | 15: iteration 80990/ 125429 | consumed samples: 20733440 | consumed tokens: 42462085120 | elapsed time per iteration (s): 1.02 | learning rate: 7.114E-05 | global batch size: 256 | lm loss: 1.956089E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.467 | TFLOPs: 41.56 | 15: iteration 81000/ 125429 | consumed samples: 20736000 | consumed tokens: 42467328000 | elapsed time per iteration (s): 1.05 | learning rate: 7.112E-05 | global batch size: 256 | lm loss: 1.955267E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.178 | TFLOPs: 40.35 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 81000 | lm loss value: 1.884989E+00 | lm loss PPL: 6.586285E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 81000 to checkpoints_1b5 0: [2022-11-26 20:02:13,040] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step81000 is begin to save! 0: [2022-11-26 20:02:13,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:02:13,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:02:13,285] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:02:13,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:02:13,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:02:13,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:02:13,491] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:02:13,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:02:13,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:02:13,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:02:13,702] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:02:13,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:02:13,806] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:02:13,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:02:13,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:02:14,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:02:14,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:02:14,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:02:14,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:02:14,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:02:14,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:02:14,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:02:14,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:02:14,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:02:14,434] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:02:14,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:02:14,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:02:14,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:02:14,640] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:02:14,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:02:14,747] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:02:14,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:02:14,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:02:14,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:02:14,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:02:15,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:02:15,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:02:15,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:02:15,158] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:02:15,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:02:15,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:02:15,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:02:15,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:02:15,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:02:15,466] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:02:15,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:02:15,571] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:02:15,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:02:15,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:02:15,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:02:15,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:02:15,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:02:15,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:02:15,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:02:15,988] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_29-model_00-model_states.pt... 0: [2022-11-26 20:02:16,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_29-model_00-model_states.pt. 0: [2022-11-26 20:02:16,088] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:02:16,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:02:16,195] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/layer_32-model_00-model_states.pt... 0: [2022-11-26 20:02:16,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/layer_32-model_00-model_states.pt. 0: [2022-11-26 20:02:16,197] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step81000/mp_rank_00_model_states.pt 0: [2022-11-26 20:02:16,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:02:16,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,239] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step81000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:02:16,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 20:02:16,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:02:16,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 20:02:16,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:02:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 20:02:16,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:02:16,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:02:16,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 20:02:16,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:02:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:02:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 20:02:16,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 20:02:16,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 20:02:16,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 20:02:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:02:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:02:16,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 20:02:16,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:02:16,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:02:16,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 20:02:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 20:02:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 20:02:16,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:02:16,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:02:16,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 20:02:16,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:02:16,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:02:16,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:02:16,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:02:16,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:02:16,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:02:16,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:02:16,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 20:02:16,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 20:02:16,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 20:02:16,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:02:16,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:02:16,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 20:02:16,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:02:16,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:02:16,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:02:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 20:02:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 20:02:16,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:02:16,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:02:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 20:02:16,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 20:02:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:02:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:02:16,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:02:16,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 20:02:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:02:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:02:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:02:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 20:02:16,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:02:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-26 20:02:16,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 20:02:16,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 20:02:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 20:02:16,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:02:16,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:02:16,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 20:02:16,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 20:02:16,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:02:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:02:16,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:02:16,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 5: [2022-11-26 20:02:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:02:16,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:02:16,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:02:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:02:16,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 1: [2022-11-26 20:02:16,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:02:16,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:02:16,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 14: [2022-11-26 20:02:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:02:16,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:02:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 20:02:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 2: [2022-11-26 20:02:16,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:02:16,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:02:16,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 20:02:16,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 9: [2022-11-26 20:02:16,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:02:16,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:02:16,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 7: [2022-11-26 20:02:16,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 20:02:16,481] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: [2022-11-26 20:02:16,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:02:16,481] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:02:16,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 10: [2022-11-26 20:02:16,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:02:16,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 20:02:16,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 11: [2022-11-26 20:02:16,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:02:16,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 20:02:16,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 15: [2022-11-26 20:02:16,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:02:16,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 20:02:16,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 6: [2022-11-26 20:02:16,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:02:16,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:02:16,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 4: [2022-11-26 20:02:16,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:02:16,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:02:16,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:02:16,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 12: [2022-11-26 20:02:16,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:02:16,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 20:02:16,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 13: [2022-11-26 20:02:16,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:02:16,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:02:16,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 3: [2022-11-26 20:02:16,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:02:16,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:02:16,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 8: [2022-11-26 20:02:16,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:02:16,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step81000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 20:02:16,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step81000 is ready now! 0: successfully saved checkpoint at iteration 81000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3539.13 15: iteration 81010/ 125429 | consumed samples: 20738560 | consumed tokens: 42472570880 | elapsed time per iteration (s): 1.42 | learning rate: 7.110E-05 | global batch size: 256 | lm loss: 1.950288E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.749 | TFLOPs: 29.70 | 15: iteration 81020/ 125429 | consumed samples: 20741120 | consumed tokens: 42477813760 | elapsed time per iteration (s): 1.05 | learning rate: 7.108E-05 | global batch size: 256 | lm loss: 1.939046E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.349 | TFLOPs: 40.22 | 15: iteration 81030/ 125429 | consumed samples: 20743680 | consumed tokens: 42483056640 | elapsed time per iteration (s): 1.07 | learning rate: 7.105E-05 | global batch size: 256 | lm loss: 1.936275E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.246 | TFLOPs: 39.70 | 15: iteration 81040/ 125429 | consumed samples: 20746240 | consumed tokens: 42488299520 | elapsed time per iteration (s): 1.04 | learning rate: 7.103E-05 | global batch size: 256 | lm loss: 1.924874E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.127 | TFLOPs: 40.51 | 15: iteration 81050/ 125429 | consumed samples: 20748800 | consumed tokens: 42493542400 | elapsed time per iteration (s): 1.05 | learning rate: 7.101E-05 | global batch size: 256 | lm loss: 1.954219E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.882 | TFLOPs: 40.30 | 15: iteration 81060/ 125429 | consumed samples: 20751360 | consumed tokens: 42498785280 | elapsed time per iteration (s): 1.03 | learning rate: 7.099E-05 | global batch size: 256 | lm loss: 1.884361E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.435 | TFLOPs: 41.06 | 15: iteration 81070/ 125429 | consumed samples: 20753920 | consumed tokens: 42504028160 | elapsed time per iteration (s): 1.03 | learning rate: 7.097E-05 | global batch size: 256 | lm loss: 1.948047E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.740 | TFLOPs: 40.94 | 15: iteration 81080/ 125429 | consumed samples: 20756480 | consumed tokens: 42509271040 | elapsed time per iteration (s): 1.27 | learning rate: 7.095E-05 | global batch size: 256 | lm loss: 1.954802E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 200.808 | TFLOPs: 33.19 | 15: iteration 81090/ 125429 | consumed samples: 20759040 | consumed tokens: 42514513920 | elapsed time per iteration (s): 1.16 | learning rate: 7.093E-05 | global batch size: 256 | lm loss: 1.955980E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.347 | TFLOPs: 36.58 | 15: iteration 81100/ 125429 | consumed samples: 20761600 | consumed tokens: 42519756800 | elapsed time per iteration (s): 1.19 | learning rate: 7.091E-05 | global batch size: 256 | lm loss: 1.926905E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.357 | TFLOPs: 35.42 | 15: iteration 81110/ 125429 | consumed samples: 20764160 | consumed tokens: 42524999680 | elapsed time per iteration (s): 1.02 | learning rate: 7.089E-05 | global batch size: 256 | lm loss: 1.910534E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.913 | TFLOPs: 41.47 | 15: iteration 81120/ 125429 | consumed samples: 20766720 | consumed tokens: 42530242560 | elapsed time per iteration (s): 1.05 | learning rate: 7.087E-05 | global batch size: 256 | lm loss: 1.936555E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.828 | TFLOPs: 40.46 | 15: iteration 81130/ 125429 | consumed samples: 20769280 | consumed tokens: 42535485440 | elapsed time per iteration (s): 1.03 | learning rate: 7.085E-05 | global batch size: 256 | lm loss: 1.929129E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.578 | TFLOPs: 40.91 | 15: iteration 81140/ 125429 | consumed samples: 20771840 | consumed tokens: 42540728320 | elapsed time per iteration (s): 1.03 | learning rate: 7.083E-05 | global batch size: 256 | lm loss: 1.986682E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.915 | TFLOPs: 41.14 | 15: iteration 81150/ 125429 | consumed samples: 20774400 | consumed tokens: 42545971200 | elapsed time per iteration (s): 2.74 | learning rate: 7.081E-05 | global batch size: 256 | lm loss: 1.935919E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 93.278 | TFLOPs: 15.41 | 15: iteration 81160/ 125429 | consumed samples: 20776960 | consumed tokens: 42551214080 | elapsed time per iteration (s): 1.04 | learning rate: 7.079E-05 | global batch size: 256 | lm loss: 1.964956E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.077 | TFLOPs: 40.83 | 15: iteration 81170/ 125429 | consumed samples: 20779520 | consumed tokens: 42556456960 | elapsed time per iteration (s): 1.02 | learning rate: 7.077E-05 | global batch size: 256 | lm loss: 1.917900E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.925 | TFLOPs: 41.63 | 15: iteration 81180/ 125429 | consumed samples: 20782080 | consumed tokens: 42561699840 | elapsed time per iteration (s): 1.02 | learning rate: 7.075E-05 | global batch size: 256 | lm loss: 1.938059E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.845 | TFLOPs: 41.29 | 15: iteration 81190/ 125429 | consumed samples: 20784640 | consumed tokens: 42566942720 | elapsed time per iteration (s): 1.04 | learning rate: 7.073E-05 | global batch size: 256 | lm loss: 1.949730E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.093 | TFLOPs: 40.67 | 15: iteration 81200/ 125429 | consumed samples: 20787200 | consumed tokens: 42572185600 | elapsed time per iteration (s): 1.08 | learning rate: 7.071E-05 | global batch size: 256 | lm loss: 1.938107E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.474 | TFLOPs: 39.08 | 15: iteration 81210/ 125429 | consumed samples: 20789760 | consumed tokens: 42577428480 | elapsed time per iteration (s): 1.03 | learning rate: 7.069E-05 | global batch size: 256 | lm loss: 1.963217E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.375 | TFLOPs: 41.05 | 15: iteration 81220/ 125429 | consumed samples: 20792320 | consumed tokens: 42582671360 | elapsed time per iteration (s): 1.05 | learning rate: 7.067E-05 | global batch size: 256 | lm loss: 1.941095E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.329 | TFLOPs: 40.21 | 15: iteration 81230/ 125429 | consumed samples: 20794880 | consumed tokens: 42587914240 | elapsed time per iteration (s): 1.05 | learning rate: 7.064E-05 | global batch size: 256 | lm loss: 1.950492E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.282 | TFLOPs: 40.37 | 15: iteration 81240/ 125429 | consumed samples: 20797440 | consumed tokens: 42593157120 | elapsed time per iteration (s): 1.02 | learning rate: 7.062E-05 | global batch size: 256 | lm loss: 1.937920E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.178 | TFLOPs: 41.51 | 15: iteration 81250/ 125429 | consumed samples: 20800000 | consumed tokens: 42598400000 | elapsed time per iteration (s): 1.05 | learning rate: 7.060E-05 | global batch size: 256 | lm loss: 1.928260E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.567 | TFLOPs: 40.42 | 15: iteration 81260/ 125429 | consumed samples: 20802560 | consumed tokens: 42603642880 | elapsed time per iteration (s): 1.05 | learning rate: 7.058E-05 | global batch size: 256 | lm loss: 1.955783E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.853 | TFLOPs: 40.13 | 15: iteration 81270/ 125429 | consumed samples: 20805120 | consumed tokens: 42608885760 | elapsed time per iteration (s): 1.03 | learning rate: 7.056E-05 | global batch size: 256 | lm loss: 1.967192E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.003 | TFLOPs: 41.15 | 15: iteration 81280/ 125429 | consumed samples: 20807680 | consumed tokens: 42614128640 | elapsed time per iteration (s): 1.05 | learning rate: 7.054E-05 | global batch size: 256 | lm loss: 1.945476E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.290 | TFLOPs: 40.37 | 15: iteration 81290/ 125429 | consumed samples: 20810240 | consumed tokens: 42619371520 | elapsed time per iteration (s): 1.08 | learning rate: 7.052E-05 | global batch size: 256 | lm loss: 1.927034E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.982 | TFLOPs: 39.33 | 15: iteration 81300/ 125429 | consumed samples: 20812800 | consumed tokens: 42624614400 | elapsed time per iteration (s): 1.12 | learning rate: 7.050E-05 | global batch size: 256 | lm loss: 1.932779E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.728 | TFLOPs: 37.80 | 15: iteration 81310/ 125429 | consumed samples: 20815360 | consumed tokens: 42629857280 | elapsed time per iteration (s): 1.09 | learning rate: 7.048E-05 | global batch size: 256 | lm loss: 1.940517E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.326 | TFLOPs: 38.89 | 15: iteration 81320/ 125429 | consumed samples: 20817920 | consumed tokens: 42635100160 | elapsed time per iteration (s): 1.16 | learning rate: 7.046E-05 | global batch size: 256 | lm loss: 1.952477E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.077 | TFLOPs: 36.37 | 15: iteration 81330/ 125429 | consumed samples: 20820480 | consumed tokens: 42640343040 | elapsed time per iteration (s): 1.02 | learning rate: 7.044E-05 | global batch size: 256 | lm loss: 1.952528E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.642 | TFLOPs: 41.42 | 15: iteration 81340/ 125429 | consumed samples: 20823040 | consumed tokens: 42645585920 | elapsed time per iteration (s): 1.04 | learning rate: 7.042E-05 | global batch size: 256 | lm loss: 1.943510E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.032 | TFLOPs: 40.49 | 15: iteration 81350/ 125429 | consumed samples: 20825600 | consumed tokens: 42650828800 | elapsed time per iteration (s): 1.08 | learning rate: 7.040E-05 | global batch size: 256 | lm loss: 1.907259E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.038 | TFLOPs: 39.01 | 15: iteration 81360/ 125429 | consumed samples: 20828160 | consumed tokens: 42656071680 | elapsed time per iteration (s): 1.08 | learning rate: 7.038E-05 | global batch size: 256 | lm loss: 1.931977E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.107 | TFLOPs: 39.18 | 15: iteration 81370/ 125429 | consumed samples: 20830720 | consumed tokens: 42661314560 | elapsed time per iteration (s): 1.04 | learning rate: 7.036E-05 | global batch size: 256 | lm loss: 1.925791E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.006 | TFLOPs: 40.49 | 15: iteration 81380/ 125429 | consumed samples: 20833280 | consumed tokens: 42666557440 | elapsed time per iteration (s): 1.07 | learning rate: 7.034E-05 | global batch size: 256 | lm loss: 1.950084E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.523 | TFLOPs: 39.58 | 15: iteration 81390/ 125429 | consumed samples: 20835840 | consumed tokens: 42671800320 | elapsed time per iteration (s): 1.05 | learning rate: 7.032E-05 | global batch size: 256 | lm loss: 1.900846E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.889 | TFLOPs: 40.47 | 15: iteration 81400/ 125429 | consumed samples: 20838400 | consumed tokens: 42677043200 | elapsed time per iteration (s): 1.03 | learning rate: 7.030E-05 | global batch size: 256 | lm loss: 1.972499E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.987 | TFLOPs: 41.15 | 15: iteration 81410/ 125429 | consumed samples: 20840960 | consumed tokens: 42682286080 | elapsed time per iteration (s): 1.05 | learning rate: 7.028E-05 | global batch size: 256 | lm loss: 1.942890E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.880 | TFLOPs: 40.14 | 15: iteration 81420/ 125429 | consumed samples: 20843520 | consumed tokens: 42687528960 | elapsed time per iteration (s): 1.05 | learning rate: 7.026E-05 | global batch size: 256 | lm loss: 1.920392E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.754 | TFLOPs: 40.12 | 15: iteration 81430/ 125429 | consumed samples: 20846080 | consumed tokens: 42692771840 | elapsed time per iteration (s): 1.05 | learning rate: 7.024E-05 | global batch size: 256 | lm loss: 1.945649E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.300 | TFLOPs: 40.37 | 15: iteration 81440/ 125429 | consumed samples: 20848640 | consumed tokens: 42698014720 | elapsed time per iteration (s): 1.07 | learning rate: 7.022E-05 | global batch size: 256 | lm loss: 1.953534E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.067 | TFLOPs: 39.67 | 15: iteration 81450/ 125429 | consumed samples: 20851200 | consumed tokens: 42703257600 | elapsed time per iteration (s): 1.10 | learning rate: 7.020E-05 | global batch size: 256 | lm loss: 1.928687E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.983 | TFLOPs: 38.34 | 15: iteration 81460/ 125429 | consumed samples: 20853760 | consumed tokens: 42708500480 | elapsed time per iteration (s): 1.03 | learning rate: 7.017E-05 | global batch size: 256 | lm loss: 1.905944E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.047 | TFLOPs: 40.99 | 15: iteration 81470/ 125429 | consumed samples: 20856320 | consumed tokens: 42713743360 | elapsed time per iteration (s): 1.04 | learning rate: 7.015E-05 | global batch size: 256 | lm loss: 1.952904E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.972 | TFLOPs: 40.81 | 15: iteration 81480/ 125429 | consumed samples: 20858880 | consumed tokens: 42718986240 | elapsed time per iteration (s): 1.06 | learning rate: 7.013E-05 | global batch size: 256 | lm loss: 1.943764E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.638 | TFLOPs: 40.10 | 15: iteration 81490/ 125429 | consumed samples: 20861440 | consumed tokens: 42724229120 | elapsed time per iteration (s): 1.05 | learning rate: 7.011E-05 | global batch size: 256 | lm loss: 1.942463E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.418 | TFLOPs: 40.23 | 15: iteration 81500/ 125429 | consumed samples: 20864000 | consumed tokens: 42729472000 | elapsed time per iteration (s): 1.07 | learning rate: 7.009E-05 | global batch size: 256 | lm loss: 1.956727E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.142 | TFLOPs: 39.52 | 15: iteration 81510/ 125429 | consumed samples: 20866560 | consumed tokens: 42734714880 | elapsed time per iteration (s): 1.10 | learning rate: 7.007E-05 | global batch size: 256 | lm loss: 1.972255E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.935 | TFLOPs: 38.49 | 15: iteration 81520/ 125429 | consumed samples: 20869120 | consumed tokens: 42739957760 | elapsed time per iteration (s): 1.05 | learning rate: 7.005E-05 | global batch size: 256 | lm loss: 1.920770E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.410 | TFLOPs: 40.39 | 15: iteration 81530/ 125429 | consumed samples: 20871680 | consumed tokens: 42745200640 | elapsed time per iteration (s): 1.03 | learning rate: 7.003E-05 | global batch size: 256 | lm loss: 1.948460E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.495 | TFLOPs: 41.07 | 15: iteration 81540/ 125429 | consumed samples: 20874240 | consumed tokens: 42750443520 | elapsed time per iteration (s): 1.03 | learning rate: 7.001E-05 | global batch size: 256 | lm loss: 1.944762E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.719 | TFLOPs: 41.10 | 15: iteration 81550/ 125429 | consumed samples: 20876800 | consumed tokens: 42755686400 | elapsed time per iteration (s): 1.04 | learning rate: 6.999E-05 | global batch size: 256 | lm loss: 1.952618E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.767 | TFLOPs: 40.78 | 15: iteration 81560/ 125429 | consumed samples: 20879360 | consumed tokens: 42760929280 | elapsed time per iteration (s): 1.06 | learning rate: 6.997E-05 | global batch size: 256 | lm loss: 1.943151E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.371 | TFLOPs: 39.89 | 15: iteration 81570/ 125429 | consumed samples: 20881920 | consumed tokens: 42766172160 | elapsed time per iteration (s): 1.05 | learning rate: 6.995E-05 | global batch size: 256 | lm loss: 1.938967E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.738 | TFLOPs: 40.11 | 15: iteration 81580/ 125429 | consumed samples: 20884480 | consumed tokens: 42771415040 | elapsed time per iteration (s): 1.05 | learning rate: 6.993E-05 | global batch size: 256 | lm loss: 1.924000E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.721 | TFLOPs: 40.44 | 15: iteration 81590/ 125429 | consumed samples: 20887040 | consumed tokens: 42776657920 | elapsed time per iteration (s): 1.09 | learning rate: 6.991E-05 | global batch size: 256 | lm loss: 1.930716E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.865 | TFLOPs: 38.81 | 15: iteration 81600/ 125429 | consumed samples: 20889600 | consumed tokens: 42781900800 | elapsed time per iteration (s): 1.05 | learning rate: 6.989E-05 | global batch size: 256 | lm loss: 1.959206E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.677 | TFLOPs: 40.43 | 15: iteration 81610/ 125429 | consumed samples: 20892160 | consumed tokens: 42787143680 | elapsed time per iteration (s): 1.03 | learning rate: 6.987E-05 | global batch size: 256 | lm loss: 1.909996E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.149 | TFLOPs: 41.01 | 15: iteration 81620/ 125429 | consumed samples: 20894720 | consumed tokens: 42792386560 | elapsed time per iteration (s): 1.05 | learning rate: 6.985E-05 | global batch size: 256 | lm loss: 1.959289E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.780 | TFLOPs: 40.29 | 15: iteration 81630/ 125429 | consumed samples: 20897280 | consumed tokens: 42797629440 | elapsed time per iteration (s): 1.07 | learning rate: 6.983E-05 | global batch size: 256 | lm loss: 1.954731E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.514 | TFLOPs: 39.42 | 15: iteration 81640/ 125429 | consumed samples: 20899840 | consumed tokens: 42802872320 | elapsed time per iteration (s): 1.06 | learning rate: 6.981E-05 | global batch size: 256 | lm loss: 1.928065E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.790 | TFLOPs: 39.79 | 15: iteration 81650/ 125429 | consumed samples: 20902400 | consumed tokens: 42808115200 | elapsed time per iteration (s): 1.02 | learning rate: 6.979E-05 | global batch size: 256 | lm loss: 1.933973E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.775 | TFLOPs: 41.61 | 15: iteration 81660/ 125429 | consumed samples: 20904960 | consumed tokens: 42813358080 | elapsed time per iteration (s): 1.07 | learning rate: 6.977E-05 | global batch size: 256 | lm loss: 1.939276E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.662 | TFLOPs: 39.44 | 15: iteration 81670/ 125429 | consumed samples: 20907520 | consumed tokens: 42818600960 | elapsed time per iteration (s): 1.07 | learning rate: 6.975E-05 | global batch size: 256 | lm loss: 1.938626E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.579 | TFLOPs: 39.59 | 15: iteration 81680/ 125429 | consumed samples: 20910080 | consumed tokens: 42823843840 | elapsed time per iteration (s): 1.05 | learning rate: 6.973E-05 | global batch size: 256 | lm loss: 1.968612E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.670 | TFLOPs: 40.10 | 15: iteration 81690/ 125429 | consumed samples: 20912640 | consumed tokens: 42829086720 | elapsed time per iteration (s): 1.09 | learning rate: 6.971E-05 | global batch size: 256 | lm loss: 1.912982E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.035 | TFLOPs: 38.84 | 15: iteration 81700/ 125429 | consumed samples: 20915200 | consumed tokens: 42834329600 | elapsed time per iteration (s): 1.03 | learning rate: 6.969E-05 | global batch size: 256 | lm loss: 1.955547E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.218 | TFLOPs: 41.02 | 15: iteration 81710/ 125429 | consumed samples: 20917760 | consumed tokens: 42839572480 | elapsed time per iteration (s): 1.07 | learning rate: 6.966E-05 | global batch size: 256 | lm loss: 1.965298E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.941 | TFLOPs: 39.65 | 15: iteration 81720/ 125429 | consumed samples: 20920320 | consumed tokens: 42844815360 | elapsed time per iteration (s): 1.07 | learning rate: 6.964E-05 | global batch size: 256 | lm loss: 1.954166E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.196 | TFLOPs: 39.69 | 15: iteration 81730/ 125429 | consumed samples: 20922880 | consumed tokens: 42850058240 | elapsed time per iteration (s): 1.07 | learning rate: 6.962E-05 | global batch size: 256 | lm loss: 1.969587E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.528 | TFLOPs: 39.58 | 15: iteration 81740/ 125429 | consumed samples: 20925440 | consumed tokens: 42855301120 | elapsed time per iteration (s): 1.04 | learning rate: 6.960E-05 | global batch size: 256 | lm loss: 1.930856E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.994 | TFLOPs: 40.65 | 15: iteration 81750/ 125429 | consumed samples: 20928000 | consumed tokens: 42860544000 | elapsed time per iteration (s): 1.08 | learning rate: 6.958E-05 | global batch size: 256 | lm loss: 1.971735E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.055 | TFLOPs: 39.34 | 15: iteration 81760/ 125429 | consumed samples: 20930560 | consumed tokens: 42865786880 | elapsed time per iteration (s): 1.06 | learning rate: 6.956E-05 | global batch size: 256 | lm loss: 1.955684E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.633 | TFLOPs: 40.10 | 15: iteration 81770/ 125429 | consumed samples: 20933120 | consumed tokens: 42871029760 | elapsed time per iteration (s): 1.09 | learning rate: 6.954E-05 | global batch size: 256 | lm loss: 1.957874E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.699 | TFLOPs: 38.79 | 15: iteration 81780/ 125429 | consumed samples: 20935680 | consumed tokens: 42876272640 | elapsed time per iteration (s): 1.06 | learning rate: 6.952E-05 | global batch size: 256 | lm loss: 1.949567E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.572 | TFLOPs: 39.92 | 15: iteration 81790/ 125429 | consumed samples: 20938240 | consumed tokens: 42881515520 | elapsed time per iteration (s): 1.04 | learning rate: 6.950E-05 | global batch size: 256 | lm loss: 1.949363E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.830 | TFLOPs: 40.79 | 15: iteration 81800/ 125429 | consumed samples: 20940800 | consumed tokens: 42886758400 | elapsed time per iteration (s): 1.03 | learning rate: 6.948E-05 | global batch size: 256 | lm loss: 1.926842E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.705 | TFLOPs: 40.94 | 15: iteration 81810/ 125429 | consumed samples: 20943360 | consumed tokens: 42892001280 | elapsed time per iteration (s): 1.03 | learning rate: 6.946E-05 | global batch size: 256 | lm loss: 1.915426E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.373 | TFLOPs: 40.88 | 15: iteration 81820/ 125429 | consumed samples: 20945920 | consumed tokens: 42897244160 | elapsed time per iteration (s): 1.06 | learning rate: 6.944E-05 | global batch size: 256 | lm loss: 1.960123E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.376 | TFLOPs: 39.89 | 15: iteration 81830/ 125429 | consumed samples: 20948480 | consumed tokens: 42902487040 | elapsed time per iteration (s): 1.05 | learning rate: 6.942E-05 | global batch size: 256 | lm loss: 1.944004E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.591 | TFLOPs: 40.42 | 15: iteration 81840/ 125429 | consumed samples: 20951040 | consumed tokens: 42907729920 | elapsed time per iteration (s): 1.03 | learning rate: 6.940E-05 | global batch size: 256 | lm loss: 1.947911E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.731 | TFLOPs: 41.10 | 15: iteration 81850/ 125429 | consumed samples: 20953600 | consumed tokens: 42912972800 | elapsed time per iteration (s): 1.06 | learning rate: 6.938E-05 | global batch size: 256 | lm loss: 1.928613E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.816 | TFLOPs: 39.96 | 15: iteration 81860/ 125429 | consumed samples: 20956160 | consumed tokens: 42918215680 | elapsed time per iteration (s): 1.05 | learning rate: 6.936E-05 | global batch size: 256 | lm loss: 1.943261E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.747 | TFLOPs: 40.12 | 15: iteration 81870/ 125429 | consumed samples: 20958720 | consumed tokens: 42923458560 | elapsed time per iteration (s): 1.06 | learning rate: 6.934E-05 | global batch size: 256 | lm loss: 1.943205E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.456 | TFLOPs: 40.07 | 15: iteration 81880/ 125429 | consumed samples: 20961280 | consumed tokens: 42928701440 | elapsed time per iteration (s): 1.06 | learning rate: 6.932E-05 | global batch size: 256 | lm loss: 1.928579E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.027 | TFLOPs: 39.83 | 15: iteration 81890/ 125429 | consumed samples: 20963840 | consumed tokens: 42933944320 | elapsed time per iteration (s): 1.04 | learning rate: 6.930E-05 | global batch size: 256 | lm loss: 1.934616E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.603 | TFLOPs: 40.59 | 15: iteration 81900/ 125429 | consumed samples: 20966400 | consumed tokens: 42939187200 | elapsed time per iteration (s): 1.08 | learning rate: 6.928E-05 | global batch size: 256 | lm loss: 1.936345E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.383 | TFLOPs: 39.23 | 15: iteration 81910/ 125429 | consumed samples: 20968960 | consumed tokens: 42944430080 | elapsed time per iteration (s): 1.04 | learning rate: 6.926E-05 | global batch size: 256 | lm loss: 1.941432E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.558 | TFLOPs: 40.58 | 15: iteration 81920/ 125429 | consumed samples: 20971520 | consumed tokens: 42949672960 | elapsed time per iteration (s): 3.56 | learning rate: 6.924E-05 | global batch size: 256 | lm loss: 1.947007E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 71.903 | TFLOPs: 11.88 | 15: iteration 81930/ 125429 | consumed samples: 20974080 | consumed tokens: 42954915840 | elapsed time per iteration (s): 1.03 | learning rate: 6.922E-05 | global batch size: 256 | lm loss: 1.938255E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.712 | TFLOPs: 41.27 | 15: iteration 81940/ 125429 | consumed samples: 20976640 | consumed tokens: 42960158720 | elapsed time per iteration (s): 1.07 | learning rate: 6.920E-05 | global batch size: 256 | lm loss: 1.931831E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.366 | TFLOPs: 39.56 | 15: iteration 81950/ 125429 | consumed samples: 20979200 | consumed tokens: 42965401600 | elapsed time per iteration (s): 1.03 | learning rate: 6.918E-05 | global batch size: 256 | lm loss: 1.933367E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.561 | TFLOPs: 41.08 | 15: iteration 81960/ 125429 | consumed samples: 20981760 | consumed tokens: 42970644480 | elapsed time per iteration (s): 1.03 | learning rate: 6.916E-05 | global batch size: 256 | lm loss: 1.945409E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.425 | TFLOPs: 41.05 | 15: iteration 81970/ 125429 | consumed samples: 20984320 | consumed tokens: 42975887360 | elapsed time per iteration (s): 1.05 | learning rate: 6.914E-05 | global batch size: 256 | lm loss: 1.954247E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.804 | TFLOPs: 40.46 | 15: iteration 81980/ 125429 | consumed samples: 20986880 | consumed tokens: 42981130240 | elapsed time per iteration (s): 1.08 | learning rate: 6.912E-05 | global batch size: 256 | lm loss: 1.918449E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.779 | TFLOPs: 39.29 | 15: iteration 81990/ 125429 | consumed samples: 20989440 | consumed tokens: 42986373120 | elapsed time per iteration (s): 1.06 | learning rate: 6.910E-05 | global batch size: 256 | lm loss: 1.909005E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.976 | TFLOPs: 39.82 | 0: [2022-11-26 20:20:36,198] [INFO] [logging.py:68:log_dist] [Rank 0] step=82000, skipped=0, lr=[6.907572213994528e-05, 6.907572213994528e-05, 6.907572213994528e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 82000/ 125429 | consumed samples: 20992000 | consumed tokens: 42991616000 | elapsed time per iteration (s): 1.04 | learning rate: 6.908E-05 | global batch size: 256 | lm loss: 1.945797E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.901 | TFLOPs: 40.80 | 0: steps: 82000 loss: 1.9504 iter time (s): 1.072 samples/sec: 238.781 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 82000 | lm loss value: 1.990115E+00 | lm loss PPL: 7.316377E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 82000 to checkpoints_1b5 0: [2022-11-26 20:20:36,568] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step82000 is begin to save! 0: [2022-11-26 20:20:36,574] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:20:36,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:20:36,856] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:20:36,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:20:36,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:20:37,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:20:37,075] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:20:37,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:20:37,190] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:20:37,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:20:37,297] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:20:37,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:20:37,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:20:37,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:20:37,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:20:37,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:20:37,635] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:20:37,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:20:37,750] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:20:37,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:20:37,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:20:37,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:20:37,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:20:38,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:20:38,093] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:20:38,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:20:38,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:20:38,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:20:38,324] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:20:38,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:20:38,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:20:38,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:20:38,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:20:38,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:20:38,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:20:38,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:20:38,790] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:20:38,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:20:38,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:20:39,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:20:39,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:20:39,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:20:39,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:20:39,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:20:39,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:20:39,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:20:39,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:20:39,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:20:39,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:20:39,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:20:39,588] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:20:39,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:20:39,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:20:39,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:20:39,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_29-model_00-model_states.pt... 0: [2022-11-26 20:20:39,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_29-model_00-model_states.pt. 0: [2022-11-26 20:20:39,927] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:20:40,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:20:40,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/layer_32-model_00-model_states.pt... 0: [2022-11-26 20:20:40,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/layer_32-model_00-model_states.pt. 0: [2022-11-26 20:20:40,044] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step82000/mp_rank_00_model_states.pt 0: [2022-11-26 20:20:40,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:20:40,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:20:40,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:20:40,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step82000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:20:40,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:20:40,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 20:20:40,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 20:20:40,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 20:20:40,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:20:40,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:20:40,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 20:20:40,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:20:40,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:20:40,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:20:40,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 20:20:40,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 20:20:40,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 20:20:40,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 20:20:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 20:20:40,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:20:40,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 20:20:40,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:20:40,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:20:40,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:20:40,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:20:40,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:20:40,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 20:20:40,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:20:40,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:20:40,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 20:20:40,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 20:20:40,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:20:40,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 5: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:20:40,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 5: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 3: [2022-11-26 20:20:40,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:20:40,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:20:40,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 20:20:40,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 6: [2022-11-26 20:20:40,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:20:40,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:20:40,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:20:40,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 20:20:40,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 20:20:40,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 20:20:40,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:20:40,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 20:20:40,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 20:20:40,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:20:40,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 9: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:20:40,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 10: [2022-11-26 20:20:40,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:20:40,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 20:20:40,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 12: [2022-11-26 20:20:40,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:20:40,302] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:20:40,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 2: [2022-11-26 20:20:40,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:20:40,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:20:40,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,268] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,268] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:20:40,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:20:40,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,297] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:20:40,287] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,287] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 8: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:20:40,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 8: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:20:40,296] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 11: [2022-11-26 20:20:40,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:20:40,297] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:20:40,297] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 20:20:40,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:20:40,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:20:40,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 14: [2022-11-26 20:20:40,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 7: [2022-11-26 20:20:40,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:20:40,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:20:40,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: [2022-11-26 20:20:40,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:20:40,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:20:40,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:20:40,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:20:40,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:20:40,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:20:40,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:20:40,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:20:40,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 13: [2022-11-26 20:20:40,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:20:40,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 4: [2022-11-26 20:20:40,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 1: [2022-11-26 20:20:40,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:20:40,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:20:40,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:20:40,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 15: [2022-11-26 20:20:40,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step82000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:20:40,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step82000 is ready now! 0: successfully saved checkpoint at iteration 82000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3990.91 15: iteration 82010/ 125429 | consumed samples: 20994560 | consumed tokens: 42996858880 | elapsed time per iteration (s): 1.47 | learning rate: 6.906E-05 | global batch size: 256 | lm loss: 1.967497E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.639 | TFLOPs: 28.86 | 15: iteration 82020/ 125429 | consumed samples: 20997120 | consumed tokens: 43002101760 | elapsed time per iteration (s): 1.08 | learning rate: 6.904E-05 | global batch size: 256 | lm loss: 1.955375E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.129 | TFLOPs: 39.35 | 15: iteration 82030/ 125429 | consumed samples: 20999680 | consumed tokens: 43007344640 | elapsed time per iteration (s): 1.06 | learning rate: 6.901E-05 | global batch size: 256 | lm loss: 1.982890E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.075 | TFLOPs: 39.84 | 15: iteration 82040/ 125429 | consumed samples: 21002240 | consumed tokens: 43012587520 | elapsed time per iteration (s): 1.08 | learning rate: 6.899E-05 | global batch size: 256 | lm loss: 1.894479E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.520 | TFLOPs: 39.09 | 15: iteration 82050/ 125429 | consumed samples: 21004800 | consumed tokens: 43017830400 | elapsed time per iteration (s): 1.04 | learning rate: 6.897E-05 | global batch size: 256 | lm loss: 1.958273E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.429 | TFLOPs: 40.72 | 15: iteration 82060/ 125429 | consumed samples: 21007360 | consumed tokens: 43023073280 | elapsed time per iteration (s): 1.07 | learning rate: 6.895E-05 | global batch size: 256 | lm loss: 1.964833E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.046 | TFLOPs: 39.67 | 15: iteration 82070/ 125429 | consumed samples: 21009920 | consumed tokens: 43028316160 | elapsed time per iteration (s): 1.04 | learning rate: 6.893E-05 | global batch size: 256 | lm loss: 1.944476E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.154 | TFLOPs: 40.84 | 15: iteration 82080/ 125429 | consumed samples: 21012480 | consumed tokens: 43033559040 | elapsed time per iteration (s): 1.13 | learning rate: 6.891E-05 | global batch size: 256 | lm loss: 1.921074E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.408 | TFLOPs: 37.42 | 15: iteration 82090/ 125429 | consumed samples: 21015040 | consumed tokens: 43038801920 | elapsed time per iteration (s): 1.10 | learning rate: 6.889E-05 | global batch size: 256 | lm loss: 1.958245E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.643 | TFLOPs: 38.45 | 15: iteration 82100/ 125429 | consumed samples: 21017600 | consumed tokens: 43044044800 | elapsed time per iteration (s): 1.06 | learning rate: 6.887E-05 | global batch size: 256 | lm loss: 1.918454E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.604 | TFLOPs: 39.93 | 15: iteration 82110/ 125429 | consumed samples: 21020160 | consumed tokens: 43049287680 | elapsed time per iteration (s): 1.04 | learning rate: 6.885E-05 | global batch size: 256 | lm loss: 1.925831E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.784 | TFLOPs: 40.78 | 15: iteration 82120/ 125429 | consumed samples: 21022720 | consumed tokens: 43054530560 | elapsed time per iteration (s): 1.06 | learning rate: 6.883E-05 | global batch size: 256 | lm loss: 1.955008E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.446 | TFLOPs: 39.74 | 15: iteration 82130/ 125429 | consumed samples: 21025280 | consumed tokens: 43059773440 | elapsed time per iteration (s): 1.05 | learning rate: 6.881E-05 | global batch size: 256 | lm loss: 1.929131E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.050 | TFLOPs: 40.17 | 15: iteration 82140/ 125429 | consumed samples: 21027840 | consumed tokens: 43065016320 | elapsed time per iteration (s): 1.03 | learning rate: 6.879E-05 | global batch size: 256 | lm loss: 1.948627E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.383 | TFLOPs: 40.88 | 15: iteration 82150/ 125429 | consumed samples: 21030400 | consumed tokens: 43070259200 | elapsed time per iteration (s): 1.26 | learning rate: 6.877E-05 | global batch size: 256 | lm loss: 1.949206E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 203.187 | TFLOPs: 33.58 | 15: iteration 82160/ 125429 | consumed samples: 21032960 | consumed tokens: 43075502080 | elapsed time per iteration (s): 1.07 | learning rate: 6.875E-05 | global batch size: 256 | lm loss: 1.974021E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.212 | TFLOPs: 39.37 | 15: iteration 82170/ 125429 | consumed samples: 21035520 | consumed tokens: 43080744960 | elapsed time per iteration (s): 1.08 | learning rate: 6.873E-05 | global batch size: 256 | lm loss: 1.966847E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.006 | TFLOPs: 39.17 | 15: iteration 82180/ 125429 | consumed samples: 21038080 | consumed tokens: 43085987840 | elapsed time per iteration (s): 1.03 | learning rate: 6.871E-05 | global batch size: 256 | lm loss: 1.938883E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.564 | TFLOPs: 41.24 | 15: iteration 82190/ 125429 | consumed samples: 21040640 | consumed tokens: 43091230720 | elapsed time per iteration (s): 1.06 | learning rate: 6.869E-05 | global batch size: 256 | lm loss: 1.945898E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.582 | TFLOPs: 39.92 | 15: iteration 82200/ 125429 | consumed samples: 21043200 | consumed tokens: 43096473600 | elapsed time per iteration (s): 1.04 | learning rate: 6.867E-05 | global batch size: 256 | lm loss: 1.961274E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.881 | TFLOPs: 40.80 | 15: iteration 82210/ 125429 | consumed samples: 21045760 | consumed tokens: 43101716480 | elapsed time per iteration (s): 1.05 | learning rate: 6.865E-05 | global batch size: 256 | lm loss: 1.931407E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.482 | TFLOPs: 40.40 | 15: iteration 82220/ 125429 | consumed samples: 21048320 | consumed tokens: 43106959360 | elapsed time per iteration (s): 1.10 | learning rate: 6.863E-05 | global batch size: 256 | lm loss: 1.949705E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.108 | TFLOPs: 38.36 | 15: iteration 82230/ 125429 | consumed samples: 21050880 | consumed tokens: 43112202240 | elapsed time per iteration (s): 1.03 | learning rate: 6.861E-05 | global batch size: 256 | lm loss: 1.927874E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.705 | TFLOPs: 41.27 | 15: iteration 82240/ 125429 | consumed samples: 21053440 | consumed tokens: 43117445120 | elapsed time per iteration (s): 1.05 | learning rate: 6.859E-05 | global batch size: 256 | lm loss: 1.946199E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.742 | TFLOPs: 40.28 | 15: iteration 82250/ 125429 | consumed samples: 21056000 | consumed tokens: 43122688000 | elapsed time per iteration (s): 1.04 | learning rate: 6.857E-05 | global batch size: 256 | lm loss: 1.947774E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.971 | TFLOPs: 40.65 | 15: iteration 82260/ 125429 | consumed samples: 21058560 | consumed tokens: 43127930880 | elapsed time per iteration (s): 1.05 | learning rate: 6.855E-05 | global batch size: 256 | lm loss: 1.960041E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.315 | TFLOPs: 40.37 | 15: iteration 82270/ 125429 | consumed samples: 21061120 | consumed tokens: 43133173760 | elapsed time per iteration (s): 1.05 | learning rate: 6.853E-05 | global batch size: 256 | lm loss: 1.965179E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.808 | TFLOPs: 40.13 | 15: iteration 82280/ 125429 | consumed samples: 21063680 | consumed tokens: 43138416640 | elapsed time per iteration (s): 1.02 | learning rate: 6.851E-05 | global batch size: 256 | lm loss: 1.954722E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.170 | TFLOPs: 41.34 | 15: iteration 82290/ 125429 | consumed samples: 21066240 | consumed tokens: 43143659520 | elapsed time per iteration (s): 1.09 | learning rate: 6.849E-05 | global batch size: 256 | lm loss: 1.945929E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.854 | TFLOPs: 38.98 | 15: iteration 82300/ 125429 | consumed samples: 21068800 | consumed tokens: 43148902400 | elapsed time per iteration (s): 1.03 | learning rate: 6.847E-05 | global batch size: 256 | lm loss: 1.928954E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.290 | TFLOPs: 41.03 | 15: iteration 82310/ 125429 | consumed samples: 21071360 | consumed tokens: 43154145280 | elapsed time per iteration (s): 1.11 | learning rate: 6.845E-05 | global batch size: 256 | lm loss: 1.917334E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.414 | TFLOPs: 38.08 | 15: iteration 82320/ 125429 | consumed samples: 21073920 | consumed tokens: 43159388160 | elapsed time per iteration (s): 1.02 | learning rate: 6.843E-05 | global batch size: 256 | lm loss: 1.920730E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.854 | TFLOPs: 41.29 | 15: iteration 82330/ 125429 | consumed samples: 21076480 | consumed tokens: 43164631040 | elapsed time per iteration (s): 1.08 | learning rate: 6.841E-05 | global batch size: 256 | lm loss: 1.919282E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.870 | TFLOPs: 39.14 | 15: iteration 82340/ 125429 | consumed samples: 21079040 | consumed tokens: 43169873920 | elapsed time per iteration (s): 1.04 | learning rate: 6.839E-05 | global batch size: 256 | lm loss: 1.967763E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.195 | TFLOPs: 40.52 | 15: iteration 82350/ 125429 | consumed samples: 21081600 | consumed tokens: 43175116800 | elapsed time per iteration (s): 1.06 | learning rate: 6.837E-05 | global batch size: 256 | lm loss: 1.916267E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.352 | TFLOPs: 39.89 | 15: iteration 82360/ 125429 | consumed samples: 21084160 | consumed tokens: 43180359680 | elapsed time per iteration (s): 1.04 | learning rate: 6.835E-05 | global batch size: 256 | lm loss: 1.937900E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.265 | TFLOPs: 40.70 | 15: iteration 82370/ 125429 | consumed samples: 21086720 | consumed tokens: 43185602560 | elapsed time per iteration (s): 1.10 | learning rate: 6.833E-05 | global batch size: 256 | lm loss: 1.948923E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.144 | TFLOPs: 38.36 | 15: iteration 82380/ 125429 | consumed samples: 21089280 | consumed tokens: 43190845440 | elapsed time per iteration (s): 1.03 | learning rate: 6.831E-05 | global batch size: 256 | lm loss: 1.918142E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.913 | TFLOPs: 40.97 | 15: iteration 82390/ 125429 | consumed samples: 21091840 | consumed tokens: 43196088320 | elapsed time per iteration (s): 1.04 | learning rate: 6.829E-05 | global batch size: 256 | lm loss: 1.931641E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.962 | TFLOPs: 40.65 | 15: iteration 82400/ 125429 | consumed samples: 21094400 | consumed tokens: 43201331200 | elapsed time per iteration (s): 1.10 | learning rate: 6.827E-05 | global batch size: 256 | lm loss: 1.921610E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.968 | TFLOPs: 38.33 | 15: iteration 82410/ 125429 | consumed samples: 21096960 | consumed tokens: 43206574080 | elapsed time per iteration (s): 1.11 | learning rate: 6.825E-05 | global batch size: 256 | lm loss: 1.924538E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.723 | TFLOPs: 37.96 | 15: iteration 82420/ 125429 | consumed samples: 21099520 | consumed tokens: 43211816960 | elapsed time per iteration (s): 1.08 | learning rate: 6.823E-05 | global batch size: 256 | lm loss: 1.932528E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.576 | TFLOPs: 39.26 | 15: iteration 82430/ 125429 | consumed samples: 21102080 | consumed tokens: 43217059840 | elapsed time per iteration (s): 1.05 | learning rate: 6.821E-05 | global batch size: 256 | lm loss: 1.957088E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.641 | TFLOPs: 40.26 | 15: iteration 82440/ 125429 | consumed samples: 21104640 | consumed tokens: 43222302720 | elapsed time per iteration (s): 1.04 | learning rate: 6.819E-05 | global batch size: 256 | lm loss: 1.926515E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.012 | TFLOPs: 40.82 | 15: iteration 82450/ 125429 | consumed samples: 21107200 | consumed tokens: 43227545600 | elapsed time per iteration (s): 1.05 | learning rate: 6.817E-05 | global batch size: 256 | lm loss: 1.956336E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.330 | TFLOPs: 40.38 | 15: iteration 82460/ 125429 | consumed samples: 21109760 | consumed tokens: 43232788480 | elapsed time per iteration (s): 1.04 | learning rate: 6.815E-05 | global batch size: 256 | lm loss: 1.954862E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.696 | TFLOPs: 40.60 | 15: iteration 82470/ 125429 | consumed samples: 21112320 | consumed tokens: 43238031360 | elapsed time per iteration (s): 1.06 | learning rate: 6.813E-05 | global batch size: 256 | lm loss: 1.915395E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.609 | TFLOPs: 40.09 | 15: iteration 82480/ 125429 | consumed samples: 21114880 | consumed tokens: 43243274240 | elapsed time per iteration (s): 1.06 | learning rate: 6.811E-05 | global batch size: 256 | lm loss: 1.922524E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.414 | TFLOPs: 39.73 | 15: iteration 82490/ 125429 | consumed samples: 21117440 | consumed tokens: 43248517120 | elapsed time per iteration (s): 1.03 | learning rate: 6.809E-05 | global batch size: 256 | lm loss: 1.925040E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.370 | TFLOPs: 40.88 | 15: iteration 82500/ 125429 | consumed samples: 21120000 | consumed tokens: 43253760000 | elapsed time per iteration (s): 1.05 | learning rate: 6.807E-05 | global batch size: 256 | lm loss: 1.963301E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.274 | TFLOPs: 40.37 | 15: iteration 82510/ 125429 | consumed samples: 21122560 | consumed tokens: 43259002880 | elapsed time per iteration (s): 1.06 | learning rate: 6.804E-05 | global batch size: 256 | lm loss: 1.928373E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.342 | TFLOPs: 39.88 | 15: iteration 82520/ 125429 | consumed samples: 21125120 | consumed tokens: 43264245760 | elapsed time per iteration (s): 1.06 | learning rate: 6.802E-05 | global batch size: 256 | lm loss: 1.948784E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.381 | TFLOPs: 40.06 | 15: iteration 82530/ 125429 | consumed samples: 21127680 | consumed tokens: 43269488640 | elapsed time per iteration (s): 1.04 | learning rate: 6.800E-05 | global batch size: 256 | lm loss: 1.957208E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.591 | TFLOPs: 40.59 | 15: iteration 82540/ 125429 | consumed samples: 21130240 | consumed tokens: 43274731520 | elapsed time per iteration (s): 1.03 | learning rate: 6.798E-05 | global batch size: 256 | lm loss: 1.929201E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.111 | TFLOPs: 41.17 | 15: iteration 82550/ 125429 | consumed samples: 21132800 | consumed tokens: 43279974400 | elapsed time per iteration (s): 1.04 | learning rate: 6.796E-05 | global batch size: 256 | lm loss: 1.955669E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.964 | TFLOPs: 40.65 | 15: iteration 82560/ 125429 | consumed samples: 21135360 | consumed tokens: 43285217280 | elapsed time per iteration (s): 1.08 | learning rate: 6.794E-05 | global batch size: 256 | lm loss: 1.942183E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.938 | TFLOPs: 39.16 | 15: iteration 82570/ 125429 | consumed samples: 21137920 | consumed tokens: 43290460160 | elapsed time per iteration (s): 1.03 | learning rate: 6.792E-05 | global batch size: 256 | lm loss: 1.938282E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.542 | TFLOPs: 40.91 | 15: iteration 82580/ 125429 | consumed samples: 21140480 | consumed tokens: 43295703040 | elapsed time per iteration (s): 1.04 | learning rate: 6.790E-05 | global batch size: 256 | lm loss: 1.973592E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.115 | TFLOPs: 40.51 | 15: iteration 82590/ 125429 | consumed samples: 21143040 | consumed tokens: 43300945920 | elapsed time per iteration (s): 1.06 | learning rate: 6.788E-05 | global batch size: 256 | lm loss: 1.937249E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.235 | TFLOPs: 40.03 | 15: iteration 82600/ 125429 | consumed samples: 21145600 | consumed tokens: 43306188800 | elapsed time per iteration (s): 1.08 | learning rate: 6.786E-05 | global batch size: 256 | lm loss: 1.960907E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.259 | TFLOPs: 39.21 | 15: iteration 82610/ 125429 | consumed samples: 21148160 | consumed tokens: 43311431680 | elapsed time per iteration (s): 1.03 | learning rate: 6.784E-05 | global batch size: 256 | lm loss: 1.933368E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.589 | TFLOPs: 40.92 | 15: iteration 82620/ 125429 | consumed samples: 21150720 | consumed tokens: 43316674560 | elapsed time per iteration (s): 1.03 | learning rate: 6.782E-05 | global batch size: 256 | lm loss: 1.943402E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.142 | TFLOPs: 41.17 | 15: iteration 82630/ 125429 | consumed samples: 21153280 | consumed tokens: 43321917440 | elapsed time per iteration (s): 1.04 | learning rate: 6.780E-05 | global batch size: 256 | lm loss: 1.953084E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.767 | TFLOPs: 40.78 | 15: iteration 82640/ 125429 | consumed samples: 21155840 | consumed tokens: 43327160320 | elapsed time per iteration (s): 1.04 | learning rate: 6.778E-05 | global batch size: 256 | lm loss: 1.947703E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.891 | TFLOPs: 40.64 | 15: iteration 82650/ 125429 | consumed samples: 21158400 | consumed tokens: 43332403200 | elapsed time per iteration (s): 1.03 | learning rate: 6.776E-05 | global batch size: 256 | lm loss: 1.939937E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.948 | TFLOPs: 41.14 | 15: iteration 82660/ 125429 | consumed samples: 21160960 | consumed tokens: 43337646080 | elapsed time per iteration (s): 1.05 | learning rate: 6.774E-05 | global batch size: 256 | lm loss: 1.955365E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.475 | TFLOPs: 40.40 | 15: iteration 82670/ 125429 | consumed samples: 21163520 | consumed tokens: 43342888960 | elapsed time per iteration (s): 1.11 | learning rate: 6.772E-05 | global batch size: 256 | lm loss: 1.930617E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.065 | TFLOPs: 38.19 | 15: iteration 82680/ 125429 | consumed samples: 21166080 | consumed tokens: 43348131840 | elapsed time per iteration (s): 1.06 | learning rate: 6.770E-05 | global batch size: 256 | lm loss: 1.927646E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.474 | TFLOPs: 39.74 | 15: iteration 82690/ 125429 | consumed samples: 21168640 | consumed tokens: 43353374720 | elapsed time per iteration (s): 1.04 | learning rate: 6.768E-05 | global batch size: 256 | lm loss: 1.938908E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.218 | TFLOPs: 40.69 | 15: iteration 82700/ 125429 | consumed samples: 21171200 | consumed tokens: 43358617600 | elapsed time per iteration (s): 1.06 | learning rate: 6.766E-05 | global batch size: 256 | lm loss: 1.952314E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.022 | TFLOPs: 39.83 | 15: iteration 82710/ 125429 | consumed samples: 21173760 | consumed tokens: 43363860480 | elapsed time per iteration (s): 1.03 | learning rate: 6.764E-05 | global batch size: 256 | lm loss: 1.923434E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.331 | TFLOPs: 41.04 | 15: iteration 82720/ 125429 | consumed samples: 21176320 | consumed tokens: 43369103360 | elapsed time per iteration (s): 1.04 | learning rate: 6.762E-05 | global batch size: 256 | lm loss: 1.932090E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.216 | TFLOPs: 40.69 | 15: iteration 82730/ 125429 | consumed samples: 21178880 | consumed tokens: 43374346240 | elapsed time per iteration (s): 1.04 | learning rate: 6.760E-05 | global batch size: 256 | lm loss: 1.943212E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.682 | TFLOPs: 40.60 | 15: iteration 82740/ 125429 | consumed samples: 21181440 | consumed tokens: 43379589120 | elapsed time per iteration (s): 1.04 | learning rate: 6.758E-05 | global batch size: 256 | lm loss: 1.948786E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.239 | TFLOPs: 40.86 | 15: iteration 82750/ 125429 | consumed samples: 21184000 | consumed tokens: 43384832000 | elapsed time per iteration (s): 1.07 | learning rate: 6.756E-05 | global batch size: 256 | lm loss: 1.956974E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.269 | TFLOPs: 39.38 | 15: iteration 82760/ 125429 | consumed samples: 21186560 | consumed tokens: 43390074880 | elapsed time per iteration (s): 1.03 | learning rate: 6.754E-05 | global batch size: 256 | lm loss: 1.966229E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.707 | TFLOPs: 41.27 | 15: iteration 82770/ 125429 | consumed samples: 21189120 | consumed tokens: 43395317760 | elapsed time per iteration (s): 1.04 | learning rate: 6.752E-05 | global batch size: 256 | lm loss: 1.940775E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.429 | TFLOPs: 40.56 | 15: iteration 82780/ 125429 | consumed samples: 21191680 | consumed tokens: 43400560640 | elapsed time per iteration (s): 1.04 | learning rate: 6.750E-05 | global batch size: 256 | lm loss: 1.949538E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.803 | TFLOPs: 40.62 | 15: iteration 82790/ 125429 | consumed samples: 21194240 | consumed tokens: 43405803520 | elapsed time per iteration (s): 1.04 | learning rate: 6.748E-05 | global batch size: 256 | lm loss: 1.979386E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.007 | TFLOPs: 40.82 | 15: iteration 82800/ 125429 | consumed samples: 21196800 | consumed tokens: 43411046400 | elapsed time per iteration (s): 1.04 | learning rate: 6.746E-05 | global batch size: 256 | lm loss: 1.937209E+00 | grad norm: 0.132 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.992 | TFLOPs: 40.49 | 15: iteration 82810/ 125429 | consumed samples: 21199360 | consumed tokens: 43416289280 | elapsed time per iteration (s): 1.03 | learning rate: 6.744E-05 | global batch size: 256 | lm loss: 1.930088E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.386 | TFLOPs: 40.88 | 15: iteration 82820/ 125429 | consumed samples: 21201920 | consumed tokens: 43421532160 | elapsed time per iteration (s): 1.04 | learning rate: 6.742E-05 | global batch size: 256 | lm loss: 1.921949E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.580 | TFLOPs: 40.75 | 15: iteration 82830/ 125429 | consumed samples: 21204480 | consumed tokens: 43426775040 | elapsed time per iteration (s): 1.09 | learning rate: 6.740E-05 | global batch size: 256 | lm loss: 1.946868E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.728 | TFLOPs: 38.79 | 15: iteration 82840/ 125429 | consumed samples: 21207040 | consumed tokens: 43432017920 | elapsed time per iteration (s): 1.05 | learning rate: 6.738E-05 | global batch size: 256 | lm loss: 1.929806E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.426 | TFLOPs: 40.23 | 15: iteration 82850/ 125429 | consumed samples: 21209600 | consumed tokens: 43437260800 | elapsed time per iteration (s): 1.04 | learning rate: 6.736E-05 | global batch size: 256 | lm loss: 1.931825E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.548 | TFLOPs: 40.74 | 15: iteration 82860/ 125429 | consumed samples: 21212160 | consumed tokens: 43442503680 | elapsed time per iteration (s): 1.04 | learning rate: 6.734E-05 | global batch size: 256 | lm loss: 1.947286E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.088 | TFLOPs: 40.83 | 15: iteration 82870/ 125429 | consumed samples: 21214720 | consumed tokens: 43447746560 | elapsed time per iteration (s): 1.13 | learning rate: 6.732E-05 | global batch size: 256 | lm loss: 1.958317E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.089 | TFLOPs: 37.53 | 15: iteration 82880/ 125429 | consumed samples: 21217280 | consumed tokens: 43452989440 | elapsed time per iteration (s): 1.05 | learning rate: 6.730E-05 | global batch size: 256 | lm loss: 1.964014E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.187 | TFLOPs: 40.19 | 15: iteration 82890/ 125429 | consumed samples: 21219840 | consumed tokens: 43458232320 | elapsed time per iteration (s): 1.05 | learning rate: 6.728E-05 | global batch size: 256 | lm loss: 1.955833E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.695 | TFLOPs: 40.44 | 15: iteration 82900/ 125429 | consumed samples: 21222400 | consumed tokens: 43463475200 | elapsed time per iteration (s): 1.04 | learning rate: 6.726E-05 | global batch size: 256 | lm loss: 1.937684E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.146 | TFLOPs: 40.51 | 15: iteration 82910/ 125429 | consumed samples: 21224960 | consumed tokens: 43468718080 | elapsed time per iteration (s): 1.02 | learning rate: 6.724E-05 | global batch size: 256 | lm loss: 1.955946E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.983 | TFLOPs: 41.31 | 15: iteration 82920/ 125429 | consumed samples: 21227520 | consumed tokens: 43473960960 | elapsed time per iteration (s): 1.04 | learning rate: 6.722E-05 | global batch size: 256 | lm loss: 1.938162E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.207 | TFLOPs: 40.69 | 15: iteration 82930/ 125429 | consumed samples: 21230080 | consumed tokens: 43479203840 | elapsed time per iteration (s): 1.04 | learning rate: 6.720E-05 | global batch size: 256 | lm loss: 1.923825E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.262 | TFLOPs: 40.86 | 15: iteration 82940/ 125429 | consumed samples: 21232640 | consumed tokens: 43484446720 | elapsed time per iteration (s): 1.06 | learning rate: 6.718E-05 | global batch size: 256 | lm loss: 1.946866E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.603 | TFLOPs: 40.09 | 15: iteration 82950/ 125429 | consumed samples: 21235200 | consumed tokens: 43489689600 | elapsed time per iteration (s): 1.04 | learning rate: 6.716E-05 | global batch size: 256 | lm loss: 1.929200E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.002 | TFLOPs: 40.65 | 15: iteration 82960/ 125429 | consumed samples: 21237760 | consumed tokens: 43494932480 | elapsed time per iteration (s): 1.04 | learning rate: 6.714E-05 | global batch size: 256 | lm loss: 1.935937E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.016 | TFLOPs: 40.66 | 15: iteration 82970/ 125429 | consumed samples: 21240320 | consumed tokens: 43500175360 | elapsed time per iteration (s): 1.06 | learning rate: 6.712E-05 | global batch size: 256 | lm loss: 1.960085E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.613 | TFLOPs: 40.09 | 15: iteration 82980/ 125429 | consumed samples: 21242880 | consumed tokens: 43505418240 | elapsed time per iteration (s): 1.07 | learning rate: 6.710E-05 | global batch size: 256 | lm loss: 1.943628E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.344 | TFLOPs: 39.72 | 15: iteration 82990/ 125429 | consumed samples: 21245440 | consumed tokens: 43510661120 | elapsed time per iteration (s): 1.06 | learning rate: 6.708E-05 | global batch size: 256 | lm loss: 1.959135E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.011 | TFLOPs: 39.83 | 15: iteration 83000/ 125429 | consumed samples: 21248000 | consumed tokens: 43515904000 | elapsed time per iteration (s): 1.06 | learning rate: 6.706E-05 | global batch size: 256 | lm loss: 1.938972E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.649 | TFLOPs: 40.10 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 83000 | lm loss value: 1.796503E+00 | lm loss PPL: 6.028531E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 83000 to checkpoints_1b5 0: [2022-11-26 20:38:16,820] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step83000 is begin to save! 0: [2022-11-26 20:38:16,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:38:17,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:38:17,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:38:17,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:38:17,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:38:17,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:38:17,456] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:38:17,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:38:17,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:38:17,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:38:17,758] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:38:17,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:38:17,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:38:18,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:38:18,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:38:18,216] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:38:18,216] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:38:18,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:38:18,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:38:18,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:38:18,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:38:18,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:38:18,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:38:18,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:38:18,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:38:19,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:38:19,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:38:19,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:38:19,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:38:19,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:38:19,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:38:19,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:38:19,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:38:19,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:38:19,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:38:19,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:38:19,896] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:38:20,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:38:20,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:38:20,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:38:20,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:38:20,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:38:20,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:38:20,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:38:20,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:38:20,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:38:20,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:38:20,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:38:20,805] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:38:20,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:38:20,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:38:21,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:38:21,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:38:21,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:38:21,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_29-model_00-model_states.pt... 0: [2022-11-26 20:38:21,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_29-model_00-model_states.pt. 0: [2022-11-26 20:38:21,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:38:21,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:38:21,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/layer_32-model_00-model_states.pt... 0: [2022-11-26 20:38:21,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/layer_32-model_00-model_states.pt. 0: [2022-11-26 20:38:21,537] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step83000/mp_rank_00_model_states.pt 0: [2022-11-26 20:38:21,537] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:38:21,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:38:21,582] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step83000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:38:21,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 20:38:21,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 20:38:21,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 20:38:21,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 20:38:21,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 20:38:21,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 20:38:21,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 20:38:21,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 20:38:21,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:38:21,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 20:38:21,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:38:21,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:38:21,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 7: [2022-11-26 20:38:21,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 6: [2022-11-26 20:38:21,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:38:21,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:38:21,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:38:21,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:38:21,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:38:21,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:38:21,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:38:21,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:38:21,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:38:21,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:38:21,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,816] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,816] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:38:21,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:38:21,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:38:21,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:38:21,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 20:38:21,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 20:38:21,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 20:38:21,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 20:38:21,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 20:38:21,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 8: [2022-11-26 20:38:21,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:38:21,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 20:38:21,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 20:38:21,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 20:38:21,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 20:38:21,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 20:38:21,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:38:21,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:38:21,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:38:21,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:38:21,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:38:21,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 4: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:38:21,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:38:21,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:38:21,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 14: [2022-11-26 20:38:21,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:38:21,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:38:21,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:38:21,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 1: [2022-11-26 20:38:21,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:38:21,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:38:21,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:38:21,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:38:21,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 15: [2022-11-26 20:38:21,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:38:21,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 11: [2022-11-26 20:38:21,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:38:21,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:38:21,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 10: [2022-11-26 20:38:21,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:38:21,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 20:38:21,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 9: [2022-11-26 20:38:21,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 20:38:21,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 20:38:21,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 20:38:21,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 5: [2022-11-26 20:38:21,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:38:21,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:38:21,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 13: [2022-11-26 20:38:21,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:38:21,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:38:21,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: [2022-11-26 20:38:21,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:38:21,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:38:22,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:38:22,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 2: [2022-11-26 20:38:22,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step83000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 12: [2022-11-26 20:38:22,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step83000 is ready now! 0: successfully saved checkpoint at iteration 83000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 5277.42 15: iteration 83010/ 125429 | consumed samples: 21250560 | consumed tokens: 43521146880 | elapsed time per iteration (s): 1.59 | learning rate: 6.704E-05 | global batch size: 256 | lm loss: 1.926854E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 161.014 | TFLOPs: 26.61 | 15: iteration 83020/ 125429 | consumed samples: 21253120 | consumed tokens: 43526389760 | elapsed time per iteration (s): 1.05 | learning rate: 6.702E-05 | global batch size: 256 | lm loss: 1.930502E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.573 | TFLOPs: 40.25 | 15: iteration 83030/ 125429 | consumed samples: 21255680 | consumed tokens: 43531632640 | elapsed time per iteration (s): 1.05 | learning rate: 6.700E-05 | global batch size: 256 | lm loss: 1.923319E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.666 | TFLOPs: 40.10 | 15: iteration 83040/ 125429 | consumed samples: 21258240 | consumed tokens: 43536875520 | elapsed time per iteration (s): 1.20 | learning rate: 6.698E-05 | global batch size: 256 | lm loss: 1.923093E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.805 | TFLOPs: 35.17 | 15: iteration 83050/ 125429 | consumed samples: 21260800 | consumed tokens: 43542118400 | elapsed time per iteration (s): 1.05 | learning rate: 6.696E-05 | global batch size: 256 | lm loss: 1.960386E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.003 | TFLOPs: 40.16 | 15: iteration 83060/ 125429 | consumed samples: 21263360 | consumed tokens: 43547361280 | elapsed time per iteration (s): 1.03 | learning rate: 6.694E-05 | global batch size: 256 | lm loss: 1.940036E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.514 | TFLOPs: 41.07 | 15: iteration 83070/ 125429 | consumed samples: 21265920 | consumed tokens: 43552604160 | elapsed time per iteration (s): 1.03 | learning rate: 6.692E-05 | global batch size: 256 | lm loss: 1.954483E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.617 | TFLOPs: 41.25 | 15: iteration 83080/ 125429 | consumed samples: 21268480 | consumed tokens: 43557847040 | elapsed time per iteration (s): 1.06 | learning rate: 6.690E-05 | global batch size: 256 | lm loss: 1.957974E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.225 | TFLOPs: 40.03 | 15: iteration 83090/ 125429 | consumed samples: 21271040 | consumed tokens: 43563089920 | elapsed time per iteration (s): 1.07 | learning rate: 6.688E-05 | global batch size: 256 | lm loss: 1.922847E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.415 | TFLOPs: 39.57 | 15: iteration 83100/ 125429 | consumed samples: 21273600 | consumed tokens: 43568332800 | elapsed time per iteration (s): 1.05 | learning rate: 6.686E-05 | global batch size: 256 | lm loss: 1.953073E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.749 | TFLOPs: 40.45 | 15: iteration 83110/ 125429 | consumed samples: 21276160 | consumed tokens: 43573575680 | elapsed time per iteration (s): 1.04 | learning rate: 6.684E-05 | global batch size: 256 | lm loss: 1.923873E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.845 | TFLOPs: 40.79 | 15: iteration 83120/ 125429 | consumed samples: 21278720 | consumed tokens: 43578818560 | elapsed time per iteration (s): 1.04 | learning rate: 6.682E-05 | global batch size: 256 | lm loss: 1.948922E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.156 | TFLOPs: 40.68 | 15: iteration 83130/ 125429 | consumed samples: 21281280 | consumed tokens: 43584061440 | elapsed time per iteration (s): 1.04 | learning rate: 6.680E-05 | global batch size: 256 | lm loss: 1.952000E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.937 | TFLOPs: 40.81 | 15: iteration 83140/ 125429 | consumed samples: 21283840 | consumed tokens: 43589304320 | elapsed time per iteration (s): 1.06 | learning rate: 6.678E-05 | global batch size: 256 | lm loss: 1.900237E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.246 | TFLOPs: 39.87 | 15: iteration 83150/ 125429 | consumed samples: 21286400 | consumed tokens: 43594547200 | elapsed time per iteration (s): 1.02 | learning rate: 6.676E-05 | global batch size: 256 | lm loss: 1.934135E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.233 | TFLOPs: 41.52 | 15: iteration 83160/ 125429 | consumed samples: 21288960 | consumed tokens: 43599790080 | elapsed time per iteration (s): 1.04 | learning rate: 6.674E-05 | global batch size: 256 | lm loss: 1.934961E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.997 | TFLOPs: 40.49 | 15: iteration 83170/ 125429 | consumed samples: 21291520 | consumed tokens: 43605032960 | elapsed time per iteration (s): 1.04 | learning rate: 6.672E-05 | global batch size: 256 | lm loss: 1.980294E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.308 | TFLOPs: 40.70 | 15: iteration 83180/ 125429 | consumed samples: 21294080 | consumed tokens: 43610275840 | elapsed time per iteration (s): 1.07 | learning rate: 6.670E-05 | global batch size: 256 | lm loss: 1.956913E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.859 | TFLOPs: 39.47 | 15: iteration 83190/ 125429 | consumed samples: 21296640 | consumed tokens: 43615518720 | elapsed time per iteration (s): 1.06 | learning rate: 6.668E-05 | global batch size: 256 | lm loss: 1.910682E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.100 | TFLOPs: 40.01 | 15: iteration 83200/ 125429 | consumed samples: 21299200 | consumed tokens: 43620761600 | elapsed time per iteration (s): 1.04 | learning rate: 6.666E-05 | global batch size: 256 | lm loss: 1.924803E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.285 | TFLOPs: 40.54 | 15: iteration 83210/ 125429 | consumed samples: 21301760 | consumed tokens: 43626004480 | elapsed time per iteration (s): 1.08 | learning rate: 6.664E-05 | global batch size: 256 | lm loss: 1.960996E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.673 | TFLOPs: 39.11 | 15: iteration 83220/ 125429 | consumed samples: 21304320 | consumed tokens: 43631247360 | elapsed time per iteration (s): 1.05 | learning rate: 6.662E-05 | global batch size: 256 | lm loss: 1.914227E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.245 | TFLOPs: 40.20 | 15: iteration 83230/ 125429 | consumed samples: 21306880 | consumed tokens: 43636490240 | elapsed time per iteration (s): 1.09 | learning rate: 6.660E-05 | global batch size: 256 | lm loss: 1.924718E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.931 | TFLOPs: 38.99 | 15: iteration 83240/ 125429 | consumed samples: 21309440 | consumed tokens: 43641733120 | elapsed time per iteration (s): 1.04 | learning rate: 6.658E-05 | global batch size: 256 | lm loss: 1.957021E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.857 | TFLOPs: 40.63 | 15: iteration 83250/ 125429 | consumed samples: 21312000 | consumed tokens: 43646976000 | elapsed time per iteration (s): 1.07 | learning rate: 6.656E-05 | global batch size: 256 | lm loss: 1.958071E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.465 | TFLOPs: 39.41 | 15: iteration 83260/ 125429 | consumed samples: 21314560 | consumed tokens: 43652218880 | elapsed time per iteration (s): 1.03 | learning rate: 6.654E-05 | global batch size: 256 | lm loss: 1.930983E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.143 | TFLOPs: 41.17 | 15: iteration 83270/ 125429 | consumed samples: 21317120 | consumed tokens: 43657461760 | elapsed time per iteration (s): 1.06 | learning rate: 6.652E-05 | global batch size: 256 | lm loss: 1.940723E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.504 | TFLOPs: 40.08 | 15: iteration 83280/ 125429 | consumed samples: 21319680 | consumed tokens: 43662704640 | elapsed time per iteration (s): 1.03 | learning rate: 6.650E-05 | global batch size: 256 | lm loss: 1.954663E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.894 | TFLOPs: 40.97 | 15: iteration 83290/ 125429 | consumed samples: 21322240 | consumed tokens: 43667947520 | elapsed time per iteration (s): 1.05 | learning rate: 6.648E-05 | global batch size: 256 | lm loss: 1.944932E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.078 | TFLOPs: 40.34 | 15: iteration 83300/ 125429 | consumed samples: 21324800 | consumed tokens: 43673190400 | elapsed time per iteration (s): 1.03 | learning rate: 6.646E-05 | global batch size: 256 | lm loss: 1.966627E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.448 | TFLOPs: 41.06 | 15: iteration 83310/ 125429 | consumed samples: 21327360 | consumed tokens: 43678433280 | elapsed time per iteration (s): 1.05 | learning rate: 6.644E-05 | global batch size: 256 | lm loss: 1.906980E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.510 | TFLOPs: 40.24 | 15: iteration 83320/ 125429 | consumed samples: 21329920 | consumed tokens: 43683676160 | elapsed time per iteration (s): 1.05 | learning rate: 6.642E-05 | global batch size: 256 | lm loss: 1.962157E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.629 | TFLOPs: 40.26 | 15: iteration 83330/ 125429 | consumed samples: 21332480 | consumed tokens: 43688919040 | elapsed time per iteration (s): 1.06 | learning rate: 6.640E-05 | global batch size: 256 | lm loss: 1.933774E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.370 | TFLOPs: 40.05 | 15: iteration 83340/ 125429 | consumed samples: 21335040 | consumed tokens: 43694161920 | elapsed time per iteration (s): 1.05 | learning rate: 6.638E-05 | global batch size: 256 | lm loss: 1.938030E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.735 | TFLOPs: 40.28 | 15: iteration 83350/ 125429 | consumed samples: 21337600 | consumed tokens: 43699404800 | elapsed time per iteration (s): 1.08 | learning rate: 6.636E-05 | global batch size: 256 | lm loss: 1.952778E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.472 | TFLOPs: 39.24 | 15: iteration 83360/ 125429 | consumed samples: 21340160 | consumed tokens: 43704647680 | elapsed time per iteration (s): 1.05 | learning rate: 6.634E-05 | global batch size: 256 | lm loss: 1.925083E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.289 | TFLOPs: 40.37 | 15: iteration 83370/ 125429 | consumed samples: 21342720 | consumed tokens: 43709890560 | elapsed time per iteration (s): 1.07 | learning rate: 6.632E-05 | global batch size: 256 | lm loss: 1.945172E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.263 | TFLOPs: 39.71 | 15: iteration 83380/ 125429 | consumed samples: 21345280 | consumed tokens: 43715133440 | elapsed time per iteration (s): 1.04 | learning rate: 6.630E-05 | global batch size: 256 | lm loss: 1.952320E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.205 | TFLOPs: 40.85 | 15: iteration 83390/ 125429 | consumed samples: 21347840 | consumed tokens: 43720376320 | elapsed time per iteration (s): 1.08 | learning rate: 6.628E-05 | global batch size: 256 | lm loss: 1.934492E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.818 | TFLOPs: 39.30 | 15: iteration 83400/ 125429 | consumed samples: 21350400 | consumed tokens: 43725619200 | elapsed time per iteration (s): 1.03 | learning rate: 6.626E-05 | global batch size: 256 | lm loss: 1.960375E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.452 | TFLOPs: 40.89 | 15: iteration 83410/ 125429 | consumed samples: 21352960 | consumed tokens: 43730862080 | elapsed time per iteration (s): 1.07 | learning rate: 6.624E-05 | global batch size: 256 | lm loss: 1.929149E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.778 | TFLOPs: 39.46 | 15: iteration 83420/ 125429 | consumed samples: 21355520 | consumed tokens: 43736104960 | elapsed time per iteration (s): 1.05 | learning rate: 6.622E-05 | global batch size: 256 | lm loss: 1.941152E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.271 | TFLOPs: 40.20 | 15: iteration 83430/ 125429 | consumed samples: 21358080 | consumed tokens: 43741347840 | elapsed time per iteration (s): 1.11 | learning rate: 6.620E-05 | global batch size: 256 | lm loss: 1.936042E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.666 | TFLOPs: 38.12 | 15: iteration 83440/ 125429 | consumed samples: 21360640 | consumed tokens: 43746590720 | elapsed time per iteration (s): 1.04 | learning rate: 6.618E-05 | global batch size: 256 | lm loss: 1.957286E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.670 | TFLOPs: 40.60 | 15: iteration 83450/ 125429 | consumed samples: 21363200 | consumed tokens: 43751833600 | elapsed time per iteration (s): 1.03 | learning rate: 6.616E-05 | global batch size: 256 | lm loss: 1.937684E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.602 | TFLOPs: 41.25 | 15: iteration 83460/ 125429 | consumed samples: 21365760 | consumed tokens: 43757076480 | elapsed time per iteration (s): 1.08 | learning rate: 6.614E-05 | global batch size: 256 | lm loss: 1.924443E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.931 | TFLOPs: 39.32 | 15: iteration 83470/ 125429 | consumed samples: 21368320 | consumed tokens: 43762319360 | elapsed time per iteration (s): 1.06 | learning rate: 6.612E-05 | global batch size: 256 | lm loss: 1.928573E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.452 | TFLOPs: 39.90 | 15: iteration 83480/ 125429 | consumed samples: 21370880 | consumed tokens: 43767562240 | elapsed time per iteration (s): 1.03 | learning rate: 6.610E-05 | global batch size: 256 | lm loss: 1.936946E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.624 | TFLOPs: 40.92 | 15: iteration 83490/ 125429 | consumed samples: 21373440 | consumed tokens: 43772805120 | elapsed time per iteration (s): 1.03 | learning rate: 6.608E-05 | global batch size: 256 | lm loss: 1.931207E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.063 | TFLOPs: 41.16 | 15: iteration 83500/ 125429 | consumed samples: 21376000 | consumed tokens: 43778048000 | elapsed time per iteration (s): 1.08 | learning rate: 6.606E-05 | global batch size: 256 | lm loss: 1.921586E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.454 | TFLOPs: 39.08 | 15: iteration 83510/ 125429 | consumed samples: 21378560 | consumed tokens: 43783290880 | elapsed time per iteration (s): 1.06 | learning rate: 6.604E-05 | global batch size: 256 | lm loss: 1.930010E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.813 | TFLOPs: 39.96 | 15: iteration 83520/ 125429 | consumed samples: 21381120 | consumed tokens: 43788533760 | elapsed time per iteration (s): 1.02 | learning rate: 6.602E-05 | global batch size: 256 | lm loss: 1.907815E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.205 | TFLOPs: 41.35 | 15: iteration 83530/ 125429 | consumed samples: 21383680 | consumed tokens: 43793776640 | elapsed time per iteration (s): 1.09 | learning rate: 6.600E-05 | global batch size: 256 | lm loss: 1.957052E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.440 | TFLOPs: 38.74 | 15: iteration 83540/ 125429 | consumed samples: 21386240 | consumed tokens: 43799019520 | elapsed time per iteration (s): 1.09 | learning rate: 6.598E-05 | global batch size: 256 | lm loss: 1.921811E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.956 | TFLOPs: 38.83 | 15: iteration 83550/ 125429 | consumed samples: 21388800 | consumed tokens: 43804262400 | elapsed time per iteration (s): 1.04 | learning rate: 6.596E-05 | global batch size: 256 | lm loss: 1.945135E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.389 | TFLOPs: 40.55 | 15: iteration 83560/ 125429 | consumed samples: 21391360 | consumed tokens: 43809505280 | elapsed time per iteration (s): 1.03 | learning rate: 6.594E-05 | global batch size: 256 | lm loss: 1.911885E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.702 | TFLOPs: 41.27 | 15: iteration 83570/ 125429 | consumed samples: 21393920 | consumed tokens: 43814748160 | elapsed time per iteration (s): 1.07 | learning rate: 6.592E-05 | global batch size: 256 | lm loss: 1.934455E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.007 | TFLOPs: 39.66 | 15: iteration 83580/ 125429 | consumed samples: 21396480 | consumed tokens: 43819991040 | elapsed time per iteration (s): 1.03 | learning rate: 6.591E-05 | global batch size: 256 | lm loss: 1.969519E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.535 | TFLOPs: 40.91 | 15: iteration 83590/ 125429 | consumed samples: 21399040 | consumed tokens: 43825233920 | elapsed time per iteration (s): 1.03 | learning rate: 6.589E-05 | global batch size: 256 | lm loss: 1.927695E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.657 | TFLOPs: 41.26 | 15: iteration 83600/ 125429 | consumed samples: 21401600 | consumed tokens: 43830476800 | elapsed time per iteration (s): 1.04 | learning rate: 6.587E-05 | global batch size: 256 | lm loss: 1.964701E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.003 | TFLOPs: 40.49 | 15: iteration 83610/ 125429 | consumed samples: 21404160 | consumed tokens: 43835719680 | elapsed time per iteration (s): 1.04 | learning rate: 6.585E-05 | global batch size: 256 | lm loss: 1.948783E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.556 | TFLOPs: 40.75 | 15: iteration 83620/ 125429 | consumed samples: 21406720 | consumed tokens: 43840962560 | elapsed time per iteration (s): 1.06 | learning rate: 6.583E-05 | global batch size: 256 | lm loss: 1.921615E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.533 | TFLOPs: 39.75 | 15: iteration 83630/ 125429 | consumed samples: 21409280 | consumed tokens: 43846205440 | elapsed time per iteration (s): 1.03 | learning rate: 6.581E-05 | global batch size: 256 | lm loss: 1.938297E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.420 | TFLOPs: 41.05 | 15: iteration 83640/ 125429 | consumed samples: 21411840 | consumed tokens: 43851448320 | elapsed time per iteration (s): 1.06 | learning rate: 6.579E-05 | global batch size: 256 | lm loss: 1.924550E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.381 | TFLOPs: 39.72 | 15: iteration 83650/ 125429 | consumed samples: 21414400 | consumed tokens: 43856691200 | elapsed time per iteration (s): 1.03 | learning rate: 6.577E-05 | global batch size: 256 | lm loss: 1.921708E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.852 | TFLOPs: 40.96 | 15: iteration 83660/ 125429 | consumed samples: 21416960 | consumed tokens: 43861934080 | elapsed time per iteration (s): 1.03 | learning rate: 6.575E-05 | global batch size: 256 | lm loss: 1.953217E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.573 | TFLOPs: 41.24 | 15: iteration 83670/ 125429 | consumed samples: 21419520 | consumed tokens: 43867176960 | elapsed time per iteration (s): 1.03 | learning rate: 6.573E-05 | global batch size: 256 | lm loss: 1.921698E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.388 | TFLOPs: 41.05 | 15: iteration 83680/ 125429 | consumed samples: 21422080 | consumed tokens: 43872419840 | elapsed time per iteration (s): 1.04 | learning rate: 6.571E-05 | global batch size: 256 | lm loss: 1.937963E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.254 | TFLOPs: 40.70 | 15: iteration 83690/ 125429 | consumed samples: 21424640 | consumed tokens: 43877662720 | elapsed time per iteration (s): 1.12 | learning rate: 6.569E-05 | global batch size: 256 | lm loss: 1.916400E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.105 | TFLOPs: 37.70 | 15: iteration 83700/ 125429 | consumed samples: 21427200 | consumed tokens: 43882905600 | elapsed time per iteration (s): 1.03 | learning rate: 6.567E-05 | global batch size: 256 | lm loss: 1.936860E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.081 | TFLOPs: 41.00 | 15: iteration 83710/ 125429 | consumed samples: 21429760 | consumed tokens: 43888148480 | elapsed time per iteration (s): 1.05 | learning rate: 6.565E-05 | global batch size: 256 | lm loss: 1.928648E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.052 | TFLOPs: 40.17 | 15: iteration 83720/ 125429 | consumed samples: 21432320 | consumed tokens: 43893391360 | elapsed time per iteration (s): 1.06 | learning rate: 6.563E-05 | global batch size: 256 | lm loss: 1.936385E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.177 | TFLOPs: 40.02 | 15: iteration 83730/ 125429 | consumed samples: 21434880 | consumed tokens: 43898634240 | elapsed time per iteration (s): 1.02 | learning rate: 6.561E-05 | global batch size: 256 | lm loss: 1.926829E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.451 | TFLOPs: 41.39 | 15: iteration 83740/ 125429 | consumed samples: 21437440 | consumed tokens: 43903877120 | elapsed time per iteration (s): 1.17 | learning rate: 6.559E-05 | global batch size: 256 | lm loss: 1.925533E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.171 | TFLOPs: 36.22 | 15: iteration 83750/ 125429 | consumed samples: 21440000 | consumed tokens: 43909120000 | elapsed time per iteration (s): 1.04 | learning rate: 6.557E-05 | global batch size: 256 | lm loss: 1.937493E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.060 | TFLOPs: 40.66 | 15: iteration 83760/ 125429 | consumed samples: 21442560 | consumed tokens: 43914362880 | elapsed time per iteration (s): 1.04 | learning rate: 6.555E-05 | global batch size: 256 | lm loss: 1.939763E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.949 | TFLOPs: 40.81 | 15: iteration 83770/ 125429 | consumed samples: 21445120 | consumed tokens: 43919605760 | elapsed time per iteration (s): 1.04 | learning rate: 6.553E-05 | global batch size: 256 | lm loss: 1.950853E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.330 | TFLOPs: 40.71 | 15: iteration 83780/ 125429 | consumed samples: 21447680 | consumed tokens: 43924848640 | elapsed time per iteration (s): 1.04 | learning rate: 6.551E-05 | global batch size: 256 | lm loss: 1.945237E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.182 | TFLOPs: 40.52 | 15: iteration 83790/ 125429 | consumed samples: 21450240 | consumed tokens: 43930091520 | elapsed time per iteration (s): 1.05 | learning rate: 6.549E-05 | global batch size: 256 | lm loss: 1.893240E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.691 | TFLOPs: 40.27 | 15: iteration 83800/ 125429 | consumed samples: 21452800 | consumed tokens: 43935334400 | elapsed time per iteration (s): 1.04 | learning rate: 6.547E-05 | global batch size: 256 | lm loss: 1.923027E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.451 | TFLOPs: 40.56 | 15: iteration 83810/ 125429 | consumed samples: 21455360 | consumed tokens: 43940577280 | elapsed time per iteration (s): 1.02 | learning rate: 6.545E-05 | global batch size: 256 | lm loss: 1.948798E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.234 | TFLOPs: 41.52 | 15: iteration 83820/ 125429 | consumed samples: 21457920 | consumed tokens: 43945820160 | elapsed time per iteration (s): 1.03 | learning rate: 6.543E-05 | global batch size: 256 | lm loss: 1.946233E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.816 | TFLOPs: 41.12 | 15: iteration 83830/ 125429 | consumed samples: 21460480 | consumed tokens: 43951063040 | elapsed time per iteration (s): 1.05 | learning rate: 6.541E-05 | global batch size: 256 | lm loss: 1.925302E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.526 | TFLOPs: 40.41 | 15: iteration 83840/ 125429 | consumed samples: 21463040 | consumed tokens: 43956305920 | elapsed time per iteration (s): 1.06 | learning rate: 6.539E-05 | global batch size: 256 | lm loss: 1.959223E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.599 | TFLOPs: 39.93 | 15: iteration 83850/ 125429 | consumed samples: 21465600 | consumed tokens: 43961548800 | elapsed time per iteration (s): 1.05 | learning rate: 6.537E-05 | global batch size: 256 | lm loss: 1.919865E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.806 | TFLOPs: 40.13 | 15: iteration 83860/ 125429 | consumed samples: 21468160 | consumed tokens: 43966791680 | elapsed time per iteration (s): 1.03 | learning rate: 6.535E-05 | global batch size: 256 | lm loss: 1.931733E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.348 | TFLOPs: 40.88 | 15: iteration 83870/ 125429 | consumed samples: 21470720 | consumed tokens: 43972034560 | elapsed time per iteration (s): 1.03 | learning rate: 6.533E-05 | global batch size: 256 | lm loss: 1.937599E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.836 | TFLOPs: 41.12 | 15: iteration 83880/ 125429 | consumed samples: 21473280 | consumed tokens: 43977277440 | elapsed time per iteration (s): 1.04 | learning rate: 6.531E-05 | global batch size: 256 | lm loss: 1.929204E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.204 | TFLOPs: 40.52 | 15: iteration 83890/ 125429 | consumed samples: 21475840 | consumed tokens: 43982520320 | elapsed time per iteration (s): 1.03 | learning rate: 6.529E-05 | global batch size: 256 | lm loss: 1.905452E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.027 | TFLOPs: 41.15 | 15: iteration 83900/ 125429 | consumed samples: 21478400 | consumed tokens: 43987763200 | elapsed time per iteration (s): 1.05 | learning rate: 6.527E-05 | global batch size: 256 | lm loss: 1.926072E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.723 | TFLOPs: 40.28 | 15: iteration 83910/ 125429 | consumed samples: 21480960 | consumed tokens: 43993006080 | elapsed time per iteration (s): 1.03 | learning rate: 6.525E-05 | global batch size: 256 | lm loss: 1.951238E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.642 | TFLOPs: 41.09 | 15: iteration 83920/ 125429 | consumed samples: 21483520 | consumed tokens: 43998248960 | elapsed time per iteration (s): 1.08 | learning rate: 6.523E-05 | global batch size: 256 | lm loss: 1.961287E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.217 | TFLOPs: 39.04 | 15: iteration 83930/ 125429 | consumed samples: 21486080 | consumed tokens: 44003491840 | elapsed time per iteration (s): 1.06 | learning rate: 6.521E-05 | global batch size: 256 | lm loss: 1.915472E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.187 | TFLOPs: 39.86 | 15: iteration 83940/ 125429 | consumed samples: 21488640 | consumed tokens: 44008734720 | elapsed time per iteration (s): 1.04 | learning rate: 6.519E-05 | global batch size: 256 | lm loss: 1.952713E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.531 | TFLOPs: 40.74 | 15: iteration 83950/ 125429 | consumed samples: 21491200 | consumed tokens: 44013977600 | elapsed time per iteration (s): 1.02 | learning rate: 6.517E-05 | global batch size: 256 | lm loss: 1.946375E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.199 | TFLOPs: 41.35 | 15: iteration 83960/ 125429 | consumed samples: 21493760 | consumed tokens: 44019220480 | elapsed time per iteration (s): 1.02 | learning rate: 6.515E-05 | global batch size: 256 | lm loss: 1.928077E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.381 | TFLOPs: 41.38 | 15: iteration 83970/ 125429 | consumed samples: 21496320 | consumed tokens: 44024463360 | elapsed time per iteration (s): 1.04 | learning rate: 6.513E-05 | global batch size: 256 | lm loss: 1.928455E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.225 | TFLOPs: 40.86 | 15: iteration 83980/ 125429 | consumed samples: 21498880 | consumed tokens: 44029706240 | elapsed time per iteration (s): 1.04 | learning rate: 6.511E-05 | global batch size: 256 | lm loss: 1.976411E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.869 | TFLOPs: 40.80 | 15: iteration 83990/ 125429 | consumed samples: 21501440 | consumed tokens: 44034949120 | elapsed time per iteration (s): 1.07 | learning rate: 6.509E-05 | global batch size: 256 | lm loss: 1.939074E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.220 | TFLOPs: 39.53 | 0: [2022-11-26 20:55:53,092] [INFO] [logging.py:68:log_dist] [Rank 0] step=84000, skipped=0, lr=[6.507390564631055e-05, 6.507390564631055e-05, 6.507390564631055e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 84000/ 125429 | consumed samples: 21504000 | consumed tokens: 44040192000 | elapsed time per iteration (s): 1.04 | learning rate: 6.507E-05 | global batch size: 256 | lm loss: 1.924520E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.031 | TFLOPs: 40.66 | 0: steps: 84000 loss: 1.9214 iter time (s): 1.051 samples/sec: 243.619 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 84000 | lm loss value: 2.018230E+00 | lm loss PPL: 7.524992E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 84000 to checkpoints_1b5 0: [2022-11-26 20:55:53,604] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step84000 is begin to save! 0: [2022-11-26 20:55:53,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_01-model_00-model_states.pt... 0: [2022-11-26 20:55:53,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_01-model_00-model_states.pt. 0: [2022-11-26 20:55:53,879] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_03-model_00-model_states.pt... 0: [2022-11-26 20:55:53,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_03-model_00-model_states.pt. 0: [2022-11-26 20:55:53,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_04-model_00-model_states.pt... 0: [2022-11-26 20:55:54,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_04-model_00-model_states.pt. 0: [2022-11-26 20:55:54,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_05-model_00-model_states.pt... 0: [2022-11-26 20:55:54,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_05-model_00-model_states.pt. 0: [2022-11-26 20:55:54,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_06-model_00-model_states.pt... 0: [2022-11-26 20:55:54,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_06-model_00-model_states.pt. 0: [2022-11-26 20:55:54,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_07-model_00-model_states.pt... 0: [2022-11-26 20:55:54,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_07-model_00-model_states.pt. 0: [2022-11-26 20:55:54,450] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_08-model_00-model_states.pt... 0: [2022-11-26 20:55:54,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_08-model_00-model_states.pt. 0: [2022-11-26 20:55:54,567] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_09-model_00-model_states.pt... 0: [2022-11-26 20:55:54,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_09-model_00-model_states.pt. 0: [2022-11-26 20:55:54,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_10-model_00-model_states.pt... 0: [2022-11-26 20:55:54,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_10-model_00-model_states.pt. 0: [2022-11-26 20:55:54,803] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_11-model_00-model_states.pt... 0: [2022-11-26 20:55:54,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_11-model_00-model_states.pt. 0: [2022-11-26 20:55:54,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_12-model_00-model_states.pt... 0: [2022-11-26 20:55:55,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_12-model_00-model_states.pt. 0: [2022-11-26 20:55:55,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_13-model_00-model_states.pt... 0: [2022-11-26 20:55:55,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_13-model_00-model_states.pt. 0: [2022-11-26 20:55:55,159] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_14-model_00-model_states.pt... 0: [2022-11-26 20:55:55,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_14-model_00-model_states.pt. 0: [2022-11-26 20:55:55,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_15-model_00-model_states.pt... 0: [2022-11-26 20:55:55,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_15-model_00-model_states.pt. 0: [2022-11-26 20:55:55,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_16-model_00-model_states.pt... 0: [2022-11-26 20:55:55,518] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_16-model_00-model_states.pt. 0: [2022-11-26 20:55:55,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_17-model_00-model_states.pt... 0: [2022-11-26 20:55:55,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_17-model_00-model_states.pt. 0: [2022-11-26 20:55:55,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_18-model_00-model_states.pt... 0: [2022-11-26 20:55:55,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_18-model_00-model_states.pt. 0: [2022-11-26 20:55:55,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_19-model_00-model_states.pt... 0: [2022-11-26 20:55:55,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_19-model_00-model_states.pt. 0: [2022-11-26 20:55:55,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_20-model_00-model_states.pt... 0: [2022-11-26 20:55:55,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_20-model_00-model_states.pt. 0: [2022-11-26 20:55:55,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_21-model_00-model_states.pt... 0: [2022-11-26 20:55:56,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_21-model_00-model_states.pt. 0: [2022-11-26 20:55:56,099] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_22-model_00-model_states.pt... 0: [2022-11-26 20:55:56,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_22-model_00-model_states.pt. 0: [2022-11-26 20:55:56,211] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_23-model_00-model_states.pt... 0: [2022-11-26 20:55:56,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_23-model_00-model_states.pt. 0: [2022-11-26 20:55:56,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_24-model_00-model_states.pt... 0: [2022-11-26 20:55:56,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_24-model_00-model_states.pt. 0: [2022-11-26 20:55:56,443] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_25-model_00-model_states.pt... 0: [2022-11-26 20:55:56,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_25-model_00-model_states.pt. 0: [2022-11-26 20:55:56,549] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_26-model_00-model_states.pt... 0: [2022-11-26 20:55:56,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_26-model_00-model_states.pt. 0: [2022-11-26 20:55:56,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_27-model_00-model_states.pt... 0: [2022-11-26 20:55:56,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_27-model_00-model_states.pt. 0: [2022-11-26 20:55:56,768] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_28-model_00-model_states.pt... 0: [2022-11-26 20:55:56,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_28-model_00-model_states.pt. 0: [2022-11-26 20:55:56,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_29-model_00-model_states.pt... 0: [2022-11-26 20:55:56,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_29-model_00-model_states.pt. 0: [2022-11-26 20:55:56,997] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_30-model_00-model_states.pt... 0: [2022-11-26 20:55:57,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_30-model_00-model_states.pt. 0: [2022-11-26 20:55:57,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/layer_32-model_00-model_states.pt... 0: [2022-11-26 20:55:57,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/layer_32-model_00-model_states.pt. 0: [2022-11-26 20:55:57,119] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step84000/mp_rank_00_model_states.pt 0: [2022-11-26 20:55:57,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/mp_rank_00_model_states.pt... 0: [2022-11-26 20:55:57,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/mp_rank_00_model_states.pt. 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 7: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 0: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 5: [2022-11-26 20:55:57,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step84000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 20:55:57,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 20:55:57,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 20:55:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 20:55:57,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:55:57,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:55:57,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 20:55:57,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 20:55:57,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 20:55:57,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 20:55:57,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 20:55:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 20:55:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 20:55:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 20:55:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 20:55:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 20:55:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 8: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 20:55:57,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 20:55:57,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 12: [2022-11-26 20:55:57,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,356] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,356] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 10: [2022-11-26 20:55:57,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 20:55:57,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 20:55:57,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:55:57,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:55:57,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:55:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 15: [2022-11-26 20:55:57,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 20:55:57,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 20:55:57,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 13: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 20:55:57,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 20:55:57,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 20:55:57,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 20:55:57,370] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,370] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 20:55:57,370] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 2: [2022-11-26 20:55:57,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 20:55:57,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 20:55:57,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 7: [2022-11-26 20:55:57,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 20:55:57,373] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 20:55:57,373] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 14: [2022-11-26 20:55:57,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 20:55:57,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 20:55:57,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 20:55:57,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 1: [2022-11-26 20:55:57,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 20:55:57,378] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 20:55:57,378] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 5: [2022-11-26 20:55:57,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 20:55:57,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 20:55:57,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 3: [2022-11-26 20:55:57,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 20:55:57,386] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 20:55:57,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:55:57,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:55:57,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:55:57,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 15: [2022-11-26 20:55:57,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 20:55:57,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:55:57,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:55:57,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:55:57,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 20:55:57,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 20:55:57,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 11: [2022-11-26 20:55:57,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 20:55:57,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: [2022-11-26 20:55:57,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 20:55:57,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 6: [2022-11-26 20:55:57,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 20:55:57,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 20:55:57,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 4: [2022-11-26 20:55:57,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 20:55:57,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step84000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 9: [2022-11-26 20:55:57,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step84000 is ready now! 0: successfully saved checkpoint at iteration 84000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3938.74 15: iteration 84010/ 125429 | consumed samples: 21506560 | consumed tokens: 44045434880 | elapsed time per iteration (s): 1.48 | learning rate: 6.505E-05 | global batch size: 256 | lm loss: 1.933163E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.953 | TFLOPs: 28.58 | 15: iteration 84020/ 125429 | consumed samples: 21509120 | consumed tokens: 44050677760 | elapsed time per iteration (s): 1.03 | learning rate: 6.503E-05 | global batch size: 256 | lm loss: 1.941687E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.439 | TFLOPs: 41.22 | 15: iteration 84030/ 125429 | consumed samples: 21511680 | consumed tokens: 44055920640 | elapsed time per iteration (s): 1.05 | learning rate: 6.501E-05 | global batch size: 256 | lm loss: 1.954527E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.743 | TFLOPs: 40.45 | 15: iteration 84040/ 125429 | consumed samples: 21514240 | consumed tokens: 44061163520 | elapsed time per iteration (s): 1.07 | learning rate: 6.500E-05 | global batch size: 256 | lm loss: 1.954156E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.258 | TFLOPs: 39.37 | 15: iteration 84050/ 125429 | consumed samples: 21516800 | consumed tokens: 44066406400 | elapsed time per iteration (s): 1.05 | learning rate: 6.498E-05 | global batch size: 256 | lm loss: 1.919766E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.961 | TFLOPs: 40.15 | 15: iteration 84060/ 125429 | consumed samples: 21519360 | consumed tokens: 44071649280 | elapsed time per iteration (s): 1.05 | learning rate: 6.496E-05 | global batch size: 256 | lm loss: 1.955192E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.648 | TFLOPs: 40.26 | 15: iteration 84070/ 125429 | consumed samples: 21521920 | consumed tokens: 44076892160 | elapsed time per iteration (s): 1.04 | learning rate: 6.494E-05 | global batch size: 256 | lm loss: 1.945522E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.433 | TFLOPs: 40.72 | 15: iteration 84080/ 125429 | consumed samples: 21524480 | consumed tokens: 44082135040 | elapsed time per iteration (s): 1.18 | learning rate: 6.492E-05 | global batch size: 256 | lm loss: 1.954698E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.385 | TFLOPs: 35.92 | 15: iteration 84090/ 125429 | consumed samples: 21527040 | consumed tokens: 44087377920 | elapsed time per iteration (s): 1.04 | learning rate: 6.490E-05 | global batch size: 256 | lm loss: 1.927029E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.888 | TFLOPs: 40.80 | 15: iteration 84100/ 125429 | consumed samples: 21529600 | consumed tokens: 44092620800 | elapsed time per iteration (s): 1.02 | learning rate: 6.488E-05 | global batch size: 256 | lm loss: 1.926993E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.428 | TFLOPs: 41.55 | 15: iteration 84110/ 125429 | consumed samples: 21532160 | consumed tokens: 44097863680 | elapsed time per iteration (s): 1.05 | learning rate: 6.486E-05 | global batch size: 256 | lm loss: 1.938177E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.471 | TFLOPs: 40.40 | 15: iteration 84120/ 125429 | consumed samples: 21534720 | consumed tokens: 44103106560 | elapsed time per iteration (s): 1.02 | learning rate: 6.484E-05 | global batch size: 256 | lm loss: 1.931023E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.214 | TFLOPs: 41.52 | 15: iteration 84130/ 125429 | consumed samples: 21537280 | consumed tokens: 44108349440 | elapsed time per iteration (s): 1.03 | learning rate: 6.482E-05 | global batch size: 256 | lm loss: 1.935061E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.375 | TFLOPs: 41.05 | 15: iteration 84140/ 125429 | consumed samples: 21539840 | consumed tokens: 44113592320 | elapsed time per iteration (s): 1.03 | learning rate: 6.480E-05 | global batch size: 256 | lm loss: 1.955636E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.506 | TFLOPs: 41.23 | 15: iteration 84150/ 125429 | consumed samples: 21542400 | consumed tokens: 44118835200 | elapsed time per iteration (s): 1.03 | learning rate: 6.478E-05 | global batch size: 256 | lm loss: 1.954059E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.669 | TFLOPs: 40.93 | 15: iteration 84160/ 125429 | consumed samples: 21544960 | consumed tokens: 44124078080 | elapsed time per iteration (s): 1.04 | learning rate: 6.476E-05 | global batch size: 256 | lm loss: 1.922000E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.757 | TFLOPs: 40.78 | 15: iteration 84170/ 125429 | consumed samples: 21547520 | consumed tokens: 44129320960 | elapsed time per iteration (s): 1.05 | learning rate: 6.474E-05 | global batch size: 256 | lm loss: 1.938079E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.689 | TFLOPs: 40.27 | 15: iteration 84180/ 125429 | consumed samples: 21550080 | consumed tokens: 44134563840 | elapsed time per iteration (s): 1.10 | learning rate: 6.472E-05 | global batch size: 256 | lm loss: 1.949931E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.728 | TFLOPs: 38.46 | 15: iteration 84190/ 125429 | consumed samples: 21552640 | consumed tokens: 44139806720 | elapsed time per iteration (s): 1.28 | learning rate: 6.470E-05 | global batch size: 256 | lm loss: 1.927408E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 199.715 | TFLOPs: 33.00 | 15: iteration 84200/ 125429 | consumed samples: 21555200 | consumed tokens: 44145049600 | elapsed time per iteration (s): 1.05 | learning rate: 6.468E-05 | global batch size: 256 | lm loss: 1.959824E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.475 | TFLOPs: 40.24 | 15: iteration 84210/ 125429 | consumed samples: 21557760 | consumed tokens: 44150292480 | elapsed time per iteration (s): 1.04 | learning rate: 6.466E-05 | global batch size: 256 | lm loss: 1.936417E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.144 | TFLOPs: 40.68 | 15: iteration 84220/ 125429 | consumed samples: 21560320 | consumed tokens: 44155535360 | elapsed time per iteration (s): 1.06 | learning rate: 6.464E-05 | global batch size: 256 | lm loss: 1.923784E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.279 | TFLOPs: 39.87 | 15: iteration 84230/ 125429 | consumed samples: 21562880 | consumed tokens: 44160778240 | elapsed time per iteration (s): 1.09 | learning rate: 6.462E-05 | global batch size: 256 | lm loss: 1.965051E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.247 | TFLOPs: 38.88 | 15: iteration 84240/ 125429 | consumed samples: 21565440 | consumed tokens: 44166021120 | elapsed time per iteration (s): 1.04 | learning rate: 6.460E-05 | global batch size: 256 | lm loss: 1.941430E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.257 | TFLOPs: 40.70 | 15: iteration 84250/ 125429 | consumed samples: 21568000 | consumed tokens: 44171264000 | elapsed time per iteration (s): 1.04 | learning rate: 6.458E-05 | global batch size: 256 | lm loss: 1.922274E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.026 | TFLOPs: 40.66 | 15: iteration 84260/ 125429 | consumed samples: 21570560 | consumed tokens: 44176506880 | elapsed time per iteration (s): 1.05 | learning rate: 6.456E-05 | global batch size: 256 | lm loss: 1.946977E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.708 | TFLOPs: 40.27 | 15: iteration 84270/ 125429 | consumed samples: 21573120 | consumed tokens: 44181749760 | elapsed time per iteration (s): 1.06 | learning rate: 6.454E-05 | global batch size: 256 | lm loss: 1.943594E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.443 | TFLOPs: 40.07 | 15: iteration 84280/ 125429 | consumed samples: 21575680 | consumed tokens: 44186992640 | elapsed time per iteration (s): 1.05 | learning rate: 6.452E-05 | global batch size: 256 | lm loss: 1.951491E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.959 | TFLOPs: 40.15 | 15: iteration 84290/ 125429 | consumed samples: 21578240 | consumed tokens: 44192235520 | elapsed time per iteration (s): 1.02 | learning rate: 6.450E-05 | global batch size: 256 | lm loss: 1.959437E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.868 | TFLOPs: 41.29 | 15: iteration 84300/ 125429 | consumed samples: 21580800 | consumed tokens: 44197478400 | elapsed time per iteration (s): 1.02 | learning rate: 6.448E-05 | global batch size: 256 | lm loss: 1.958274E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.729 | TFLOPs: 41.60 | 15: iteration 84310/ 125429 | consumed samples: 21583360 | consumed tokens: 44202721280 | elapsed time per iteration (s): 1.03 | learning rate: 6.446E-05 | global batch size: 256 | lm loss: 1.951869E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.884 | TFLOPs: 41.13 | 15: iteration 84320/ 125429 | consumed samples: 21585920 | consumed tokens: 44207964160 | elapsed time per iteration (s): 1.05 | learning rate: 6.444E-05 | global batch size: 256 | lm loss: 1.947361E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.877 | TFLOPs: 40.47 | 15: iteration 84330/ 125429 | consumed samples: 21588480 | consumed tokens: 44213207040 | elapsed time per iteration (s): 1.15 | learning rate: 6.442E-05 | global batch size: 256 | lm loss: 1.923754E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.740 | TFLOPs: 36.64 | 15: iteration 84340/ 125429 | consumed samples: 21591040 | consumed tokens: 44218449920 | elapsed time per iteration (s): 1.05 | learning rate: 6.440E-05 | global batch size: 256 | lm loss: 1.944576E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.976 | TFLOPs: 40.48 | 15: iteration 84350/ 125429 | consumed samples: 21593600 | consumed tokens: 44223692800 | elapsed time per iteration (s): 1.02 | learning rate: 6.439E-05 | global batch size: 256 | lm loss: 1.919546E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.895 | TFLOPs: 41.30 | 15: iteration 84360/ 125429 | consumed samples: 21596160 | consumed tokens: 44228935680 | elapsed time per iteration (s): 1.03 | learning rate: 6.437E-05 | global batch size: 256 | lm loss: 1.931887E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.108 | TFLOPs: 41.17 | 15: iteration 84370/ 125429 | consumed samples: 21598720 | consumed tokens: 44234178560 | elapsed time per iteration (s): 1.03 | learning rate: 6.435E-05 | global batch size: 256 | lm loss: 1.959106E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.275 | TFLOPs: 41.03 | 15: iteration 84380/ 125429 | consumed samples: 21601280 | consumed tokens: 44239421440 | elapsed time per iteration (s): 1.03 | learning rate: 6.433E-05 | global batch size: 256 | lm loss: 1.965009E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.989 | TFLOPs: 40.98 | 15: iteration 84390/ 125429 | consumed samples: 21603840 | consumed tokens: 44244664320 | elapsed time per iteration (s): 1.15 | learning rate: 6.431E-05 | global batch size: 256 | lm loss: 1.952981E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.946 | TFLOPs: 36.84 | 15: iteration 84400/ 125429 | consumed samples: 21606400 | consumed tokens: 44249907200 | elapsed time per iteration (s): 1.03 | learning rate: 6.429E-05 | global batch size: 256 | lm loss: 1.948394E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.414 | TFLOPs: 40.89 | 15: iteration 84410/ 125429 | consumed samples: 21608960 | consumed tokens: 44255150080 | elapsed time per iteration (s): 1.05 | learning rate: 6.427E-05 | global batch size: 256 | lm loss: 1.932665E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.336 | TFLOPs: 40.38 | 15: iteration 84420/ 125429 | consumed samples: 21611520 | consumed tokens: 44260392960 | elapsed time per iteration (s): 1.03 | learning rate: 6.425E-05 | global batch size: 256 | lm loss: 1.935563E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.992 | TFLOPs: 40.98 | 15: iteration 84430/ 125429 | consumed samples: 21614080 | consumed tokens: 44265635840 | elapsed time per iteration (s): 1.08 | learning rate: 6.423E-05 | global batch size: 256 | lm loss: 1.958474E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.576 | TFLOPs: 39.10 | 15: iteration 84440/ 125429 | consumed samples: 21616640 | consumed tokens: 44270878720 | elapsed time per iteration (s): 1.08 | learning rate: 6.421E-05 | global batch size: 256 | lm loss: 1.922349E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.747 | TFLOPs: 39.12 | 15: iteration 84450/ 125429 | consumed samples: 21619200 | consumed tokens: 44276121600 | elapsed time per iteration (s): 1.03 | learning rate: 6.419E-05 | global batch size: 256 | lm loss: 1.941597E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.017 | TFLOPs: 40.99 | 15: iteration 84460/ 125429 | consumed samples: 21621760 | consumed tokens: 44281364480 | elapsed time per iteration (s): 1.04 | learning rate: 6.417E-05 | global batch size: 256 | lm loss: 1.942164E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.480 | TFLOPs: 40.73 | 15: iteration 84470/ 125429 | consumed samples: 21624320 | consumed tokens: 44286607360 | elapsed time per iteration (s): 1.05 | learning rate: 6.415E-05 | global batch size: 256 | lm loss: 1.943762E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.952 | TFLOPs: 40.15 | 15: iteration 84480/ 125429 | consumed samples: 21626880 | consumed tokens: 44291850240 | elapsed time per iteration (s): 1.02 | learning rate: 6.413E-05 | global batch size: 256 | lm loss: 1.912249E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.232 | TFLOPs: 41.35 | 15: iteration 84490/ 125429 | consumed samples: 21629440 | consumed tokens: 44297093120 | elapsed time per iteration (s): 1.03 | learning rate: 6.411E-05 | global batch size: 256 | lm loss: 1.940783E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.789 | TFLOPs: 40.95 | 15: iteration 84500/ 125429 | consumed samples: 21632000 | consumed tokens: 44302336000 | elapsed time per iteration (s): 1.05 | learning rate: 6.409E-05 | global batch size: 256 | lm loss: 1.915625E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.882 | TFLOPs: 40.30 | 15: iteration 84510/ 125429 | consumed samples: 21634560 | consumed tokens: 44307578880 | elapsed time per iteration (s): 1.04 | learning rate: 6.407E-05 | global batch size: 256 | lm loss: 1.924251E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.577 | TFLOPs: 40.75 | 15: iteration 84520/ 125429 | consumed samples: 21637120 | consumed tokens: 44312821760 | elapsed time per iteration (s): 1.06 | learning rate: 6.405E-05 | global batch size: 256 | lm loss: 1.975641E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.733 | TFLOPs: 39.95 | 15: iteration 84530/ 125429 | consumed samples: 21639680 | consumed tokens: 44318064640 | elapsed time per iteration (s): 1.02 | learning rate: 6.403E-05 | global batch size: 256 | lm loss: 1.961581E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.610 | TFLOPs: 41.42 | 15: iteration 84540/ 125429 | consumed samples: 21642240 | consumed tokens: 44323307520 | elapsed time per iteration (s): 1.03 | learning rate: 6.401E-05 | global batch size: 256 | lm loss: 1.934922E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.669 | TFLOPs: 40.93 | 15: iteration 84550/ 125429 | consumed samples: 21644800 | consumed tokens: 44328550400 | elapsed time per iteration (s): 1.04 | learning rate: 6.399E-05 | global batch size: 256 | lm loss: 1.921304E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.105 | TFLOPs: 40.67 | 15: iteration 84560/ 125429 | consumed samples: 21647360 | consumed tokens: 44333793280 | elapsed time per iteration (s): 1.03 | learning rate: 6.397E-05 | global batch size: 256 | lm loss: 1.911519E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.820 | TFLOPs: 40.95 | 15: iteration 84570/ 125429 | consumed samples: 21649920 | consumed tokens: 44339036160 | elapsed time per iteration (s): 1.04 | learning rate: 6.395E-05 | global batch size: 256 | lm loss: 1.919700E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.039 | TFLOPs: 40.83 | 15: iteration 84580/ 125429 | consumed samples: 21652480 | consumed tokens: 44344279040 | elapsed time per iteration (s): 1.03 | learning rate: 6.393E-05 | global batch size: 256 | lm loss: 1.927924E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.699 | TFLOPs: 41.26 | 15: iteration 84590/ 125429 | consumed samples: 21655040 | consumed tokens: 44349521920 | elapsed time per iteration (s): 1.03 | learning rate: 6.391E-05 | global batch size: 256 | lm loss: 1.938528E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.492 | TFLOPs: 41.07 | 15: iteration 84600/ 125429 | consumed samples: 21657600 | consumed tokens: 44354764800 | elapsed time per iteration (s): 1.03 | learning rate: 6.390E-05 | global batch size: 256 | lm loss: 1.927134E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.908 | TFLOPs: 41.13 | 15: iteration 84610/ 125429 | consumed samples: 21660160 | consumed tokens: 44360007680 | elapsed time per iteration (s): 1.05 | learning rate: 6.388E-05 | global batch size: 256 | lm loss: 1.921612E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.813 | TFLOPs: 40.46 | 15: iteration 84620/ 125429 | consumed samples: 21662720 | consumed tokens: 44365250560 | elapsed time per iteration (s): 1.03 | learning rate: 6.386E-05 | global batch size: 256 | lm loss: 1.939813E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.420 | TFLOPs: 41.05 | 15: iteration 84630/ 125429 | consumed samples: 21665280 | consumed tokens: 44370493440 | elapsed time per iteration (s): 1.17 | learning rate: 6.384E-05 | global batch size: 256 | lm loss: 1.920706E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.069 | TFLOPs: 36.20 | 15: iteration 84640/ 125429 | consumed samples: 21667840 | consumed tokens: 44375736320 | elapsed time per iteration (s): 1.03 | learning rate: 6.382E-05 | global batch size: 256 | lm loss: 1.926763E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.587 | TFLOPs: 41.08 | 15: iteration 84650/ 125429 | consumed samples: 21670400 | consumed tokens: 44380979200 | elapsed time per iteration (s): 1.09 | learning rate: 6.380E-05 | global batch size: 256 | lm loss: 1.926709E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.019 | TFLOPs: 38.84 | 15: iteration 84660/ 125429 | consumed samples: 21672960 | consumed tokens: 44386222080 | elapsed time per iteration (s): 1.04 | learning rate: 6.378E-05 | global batch size: 256 | lm loss: 1.929101E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.980 | TFLOPs: 40.48 | 15: iteration 84670/ 125429 | consumed samples: 21675520 | consumed tokens: 44391464960 | elapsed time per iteration (s): 1.05 | learning rate: 6.376E-05 | global batch size: 256 | lm loss: 1.940072E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.166 | TFLOPs: 40.35 | 15: iteration 84680/ 125429 | consumed samples: 21678080 | consumed tokens: 44396707840 | elapsed time per iteration (s): 1.03 | learning rate: 6.374E-05 | global batch size: 256 | lm loss: 1.924937E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.466 | TFLOPs: 41.06 | 15: iteration 84690/ 125429 | consumed samples: 21680640 | consumed tokens: 44401950720 | elapsed time per iteration (s): 1.04 | learning rate: 6.372E-05 | global batch size: 256 | lm loss: 1.932685E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.156 | TFLOPs: 40.84 | 15: iteration 84700/ 125429 | consumed samples: 21683200 | consumed tokens: 44407193600 | elapsed time per iteration (s): 1.03 | learning rate: 6.370E-05 | global batch size: 256 | lm loss: 1.950801E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.720 | TFLOPs: 41.27 | 15: iteration 84710/ 125429 | consumed samples: 21685760 | consumed tokens: 44412436480 | elapsed time per iteration (s): 1.03 | learning rate: 6.368E-05 | global batch size: 256 | lm loss: 1.957407E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.376 | TFLOPs: 40.88 | 15: iteration 84720/ 125429 | consumed samples: 21688320 | consumed tokens: 44417679360 | elapsed time per iteration (s): 1.04 | learning rate: 6.366E-05 | global batch size: 256 | lm loss: 1.927412E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.336 | TFLOPs: 40.54 | 15: iteration 84730/ 125429 | consumed samples: 21690880 | consumed tokens: 44422922240 | elapsed time per iteration (s): 1.59 | learning rate: 6.364E-05 | global batch size: 256 | lm loss: 1.927585E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 161.232 | TFLOPs: 26.64 | 15: iteration 84740/ 125429 | consumed samples: 21693440 | consumed tokens: 44428165120 | elapsed time per iteration (s): 1.18 | learning rate: 6.362E-05 | global batch size: 256 | lm loss: 1.927652E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.473 | TFLOPs: 35.77 | 15: iteration 84750/ 125429 | consumed samples: 21696000 | consumed tokens: 44433408000 | elapsed time per iteration (s): 1.05 | learning rate: 6.360E-05 | global batch size: 256 | lm loss: 1.941810E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.988 | TFLOPs: 40.16 | 15: iteration 84760/ 125429 | consumed samples: 21698560 | consumed tokens: 44438650880 | elapsed time per iteration (s): 1.02 | learning rate: 6.358E-05 | global batch size: 256 | lm loss: 1.941526E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.059 | TFLOPs: 41.32 | 15: iteration 84770/ 125429 | consumed samples: 21701120 | consumed tokens: 44443893760 | elapsed time per iteration (s): 1.03 | learning rate: 6.356E-05 | global batch size: 256 | lm loss: 1.945604E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.820 | TFLOPs: 41.12 | 15: iteration 84780/ 125429 | consumed samples: 21703680 | consumed tokens: 44449136640 | elapsed time per iteration (s): 1.04 | learning rate: 6.354E-05 | global batch size: 256 | lm loss: 1.949637E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.737 | TFLOPs: 40.61 | 15: iteration 84790/ 125429 | consumed samples: 21706240 | consumed tokens: 44454379520 | elapsed time per iteration (s): 1.03 | learning rate: 6.352E-05 | global batch size: 256 | lm loss: 1.978201E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.913 | TFLOPs: 41.13 | 15: iteration 84800/ 125429 | consumed samples: 21708800 | consumed tokens: 44459622400 | elapsed time per iteration (s): 1.03 | learning rate: 6.350E-05 | global batch size: 256 | lm loss: 1.925208E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.309 | TFLOPs: 41.20 | 15: iteration 84810/ 125429 | consumed samples: 21711360 | consumed tokens: 44464865280 | elapsed time per iteration (s): 1.04 | learning rate: 6.349E-05 | global batch size: 256 | lm loss: 1.940199E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.021 | TFLOPs: 40.82 | 15: iteration 84820/ 125429 | consumed samples: 21713920 | consumed tokens: 44470108160 | elapsed time per iteration (s): 1.04 | learning rate: 6.347E-05 | global batch size: 256 | lm loss: 1.921625E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.816 | TFLOPs: 40.79 | 15: iteration 84830/ 125429 | consumed samples: 21716480 | consumed tokens: 44475351040 | elapsed time per iteration (s): 1.03 | learning rate: 6.345E-05 | global batch size: 256 | lm loss: 1.946897E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.565 | TFLOPs: 41.24 | 15: iteration 84840/ 125429 | consumed samples: 21719040 | consumed tokens: 44480593920 | elapsed time per iteration (s): 1.03 | learning rate: 6.343E-05 | global batch size: 256 | lm loss: 1.929841E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.096 | TFLOPs: 41.00 | 15: iteration 84850/ 125429 | consumed samples: 21721600 | consumed tokens: 44485836800 | elapsed time per iteration (s): 1.02 | learning rate: 6.341E-05 | global batch size: 256 | lm loss: 1.944771E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.577 | TFLOPs: 41.41 | 15: iteration 84860/ 125429 | consumed samples: 21724160 | consumed tokens: 44491079680 | elapsed time per iteration (s): 1.03 | learning rate: 6.339E-05 | global batch size: 256 | lm loss: 1.945764E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.730 | TFLOPs: 40.94 | 15: iteration 84870/ 125429 | consumed samples: 21726720 | consumed tokens: 44496322560 | elapsed time per iteration (s): 1.03 | learning rate: 6.337E-05 | global batch size: 256 | lm loss: 1.935482E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.955 | TFLOPs: 40.98 | 15: iteration 84880/ 125429 | consumed samples: 21729280 | consumed tokens: 44501565440 | elapsed time per iteration (s): 1.04 | learning rate: 6.335E-05 | global batch size: 256 | lm loss: 1.943447E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.476 | TFLOPs: 40.73 | 15: iteration 84890/ 125429 | consumed samples: 21731840 | consumed tokens: 44506808320 | elapsed time per iteration (s): 1.05 | learning rate: 6.333E-05 | global batch size: 256 | lm loss: 1.931219E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.560 | TFLOPs: 40.42 | 15: iteration 84900/ 125429 | consumed samples: 21734400 | consumed tokens: 44512051200 | elapsed time per iteration (s): 1.04 | learning rate: 6.331E-05 | global batch size: 256 | lm loss: 1.943012E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.114 | TFLOPs: 40.84 | 15: iteration 84910/ 125429 | consumed samples: 21736960 | consumed tokens: 44517294080 | elapsed time per iteration (s): 1.06 | learning rate: 6.329E-05 | global batch size: 256 | lm loss: 1.939575E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.423 | TFLOPs: 40.06 | 15: iteration 84920/ 125429 | consumed samples: 21739520 | consumed tokens: 44522536960 | elapsed time per iteration (s): 1.05 | learning rate: 6.327E-05 | global batch size: 256 | lm loss: 1.920589E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.376 | TFLOPs: 40.22 | 15: iteration 84930/ 125429 | consumed samples: 21742080 | consumed tokens: 44527779840 | elapsed time per iteration (s): 1.09 | learning rate: 6.325E-05 | global batch size: 256 | lm loss: 1.955245E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.756 | TFLOPs: 38.96 | 15: iteration 84940/ 125429 | consumed samples: 21744640 | consumed tokens: 44533022720 | elapsed time per iteration (s): 1.05 | learning rate: 6.323E-05 | global batch size: 256 | lm loss: 1.950139E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.837 | TFLOPs: 40.13 | 15: iteration 84950/ 125429 | consumed samples: 21747200 | consumed tokens: 44538265600 | elapsed time per iteration (s): 1.05 | learning rate: 6.321E-05 | global batch size: 256 | lm loss: 1.956211E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.455 | TFLOPs: 40.40 | 15: iteration 84960/ 125429 | consumed samples: 21749760 | consumed tokens: 44543508480 | elapsed time per iteration (s): 1.06 | learning rate: 6.319E-05 | global batch size: 256 | lm loss: 1.904499E+00 | grad norm: 0.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.675 | TFLOPs: 39.77 | 15: iteration 84970/ 125429 | consumed samples: 21752320 | consumed tokens: 44548751360 | elapsed time per iteration (s): 1.12 | learning rate: 6.317E-05 | global batch size: 256 | lm loss: 1.938333E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.561 | TFLOPs: 37.77 | 15: iteration 84980/ 125429 | consumed samples: 21754880 | consumed tokens: 44553994240 | elapsed time per iteration (s): 1.05 | learning rate: 6.315E-05 | global batch size: 256 | lm loss: 1.950468E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.535 | TFLOPs: 40.41 | 15: iteration 84990/ 125429 | consumed samples: 21757440 | consumed tokens: 44559237120 | elapsed time per iteration (s): 1.05 | learning rate: 6.313E-05 | global batch size: 256 | lm loss: 1.930274E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.610 | TFLOPs: 40.42 | 15: iteration 85000/ 125429 | consumed samples: 21760000 | consumed tokens: 44564480000 | elapsed time per iteration (s): 1.05 | learning rate: 6.312E-05 | global batch size: 256 | lm loss: 1.961509E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.418 | TFLOPs: 40.23 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 85000 | lm loss value: 1.894334E+00 | lm loss PPL: 6.648122E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 85000 to checkpoints_1b5 0: [2022-11-26 21:13:34,532] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step85000 is begin to save! 0: [2022-11-26 21:13:34,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:13:34,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:13:34,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:13:34,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:13:34,909] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:13:35,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:13:35,028] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:13:35,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:13:35,144] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:13:35,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:13:35,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:13:35,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:13:35,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:13:35,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:13:35,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:13:35,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:13:35,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:13:35,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:13:35,715] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:13:35,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:13:35,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:13:35,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:13:35,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:13:36,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:13:36,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:13:36,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:13:36,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:13:36,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:13:36,264] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:13:36,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:13:36,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:13:36,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:13:36,482] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:13:36,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:13:36,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:13:36,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:13:36,705] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:13:36,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:13:36,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:13:36,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:13:36,918] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:13:37,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:13:37,030] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:13:37,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:13:37,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:13:37,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:13:37,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:13:37,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:13:37,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:13:37,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:13:37,459] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:13:37,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:13:37,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:13:37,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:13:37,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_29-model_00-model_states.pt... 0: [2022-11-26 21:13:37,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_29-model_00-model_states.pt. 0: [2022-11-26 21:13:37,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:13:37,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:13:37,889] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/layer_32-model_00-model_states.pt... 0: [2022-11-26 21:13:37,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/layer_32-model_00-model_states.pt. 0: [2022-11-26 21:13:37,895] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step85000/mp_rank_00_model_states.pt 0: [2022-11-26 21:13:37,895] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:13:37,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:37,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step85000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:13:38,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 21:13:38,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:13:38,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:13:38,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 21:13:38,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 21:13:38,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 21:13:38,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 21:13:38,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:13:38,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:13:38,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 21:13:38,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 21:13:38,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:13:38,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 21:13:38,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:13:38,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:13:38,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 21:13:38,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:13:38,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 21:13:38,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:13:38,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 21:13:38,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 21:13:38,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 21:13:38,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 21:13:38,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 21:13:38,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 21:13:38,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:13:38,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:13:38,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 3: [2022-11-26 21:13:38,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:13:38,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 21:13:38,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 21:13:38,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 21:13:38,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 21:13:38,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:13:38,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:13:38,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:13:38,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:13:38,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:13:38,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:13:38,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:13:38,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:13:38,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: [2022-11-26 21:13:38,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:13:38,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 21:13:38,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 21:13:38,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 1: [2022-11-26 21:13:38,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:13:38,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 21:13:38,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 21:13:38,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 21:13:38,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:13:38,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 21:13:38,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 21:13:38,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 2: [2022-11-26 21:13:38,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:13:38,146] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 21:13:38,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 21:13:38,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 12: [2022-11-26 21:13:38,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:13:38,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 21:13:38,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 5: [2022-11-26 21:13:38,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:13:38,152] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:13:38,152] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 3: [2022-11-26 21:13:38,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:13:38,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 21:13:38,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 21:13:38,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 21:13:38,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:13:38,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 21:13:38,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 4: [2022-11-26 21:13:38,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 21:13:38,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 9: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:13:38,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 11: [2022-11-26 21:13:38,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:13:38,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 21:13:38,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 6: [2022-11-26 21:13:38,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:13:38,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 21:13:38,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 7: [2022-11-26 21:13:38,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:13:38,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 21:13:38,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,158] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,158] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 14: [2022-11-26 21:13:38,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:13:38,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 21:13:38,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 8: [2022-11-26 21:13:38,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 10: [2022-11-26 21:13:38,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 13: [2022-11-26 21:13:38,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:13:38,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 15: [2022-11-26 21:13:38,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step85000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 21:13:38,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85000 is ready now! 0: successfully saved checkpoint at iteration 85000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3861.85 15: iteration 85010/ 125429 | consumed samples: 21762560 | consumed tokens: 44569722880 | elapsed time per iteration (s): 1.47 | learning rate: 6.310E-05 | global batch size: 256 | lm loss: 1.951501E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.344 | TFLOPs: 28.81 | 15: iteration 85020/ 125429 | consumed samples: 21765120 | consumed tokens: 44574965760 | elapsed time per iteration (s): 1.04 | learning rate: 6.308E-05 | global batch size: 256 | lm loss: 1.939863E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.451 | TFLOPs: 40.73 | 15: iteration 85030/ 125429 | consumed samples: 21767680 | consumed tokens: 44580208640 | elapsed time per iteration (s): 1.04 | learning rate: 6.306E-05 | global batch size: 256 | lm loss: 1.951075E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.032 | TFLOPs: 40.66 | 15: iteration 85040/ 125429 | consumed samples: 21770240 | consumed tokens: 44585451520 | elapsed time per iteration (s): 1.05 | learning rate: 6.304E-05 | global batch size: 256 | lm loss: 1.945200E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.593 | TFLOPs: 40.42 | 15: iteration 85050/ 125429 | consumed samples: 21772800 | consumed tokens: 44590694400 | elapsed time per iteration (s): 1.04 | learning rate: 6.302E-05 | global batch size: 256 | lm loss: 1.884255E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.955 | TFLOPs: 40.65 | 15: iteration 85060/ 125429 | consumed samples: 21775360 | consumed tokens: 44595937280 | elapsed time per iteration (s): 1.03 | learning rate: 6.300E-05 | global batch size: 256 | lm loss: 1.931080E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.743 | TFLOPs: 41.11 | 15: iteration 85070/ 125429 | consumed samples: 21777920 | consumed tokens: 44601180160 | elapsed time per iteration (s): 1.05 | learning rate: 6.298E-05 | global batch size: 256 | lm loss: 1.943259E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.063 | TFLOPs: 40.17 | 15: iteration 85080/ 125429 | consumed samples: 21780480 | consumed tokens: 44606423040 | elapsed time per iteration (s): 1.05 | learning rate: 6.296E-05 | global batch size: 256 | lm loss: 1.895892E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.105 | TFLOPs: 40.17 | 15: iteration 85090/ 125429 | consumed samples: 21783040 | consumed tokens: 44611665920 | elapsed time per iteration (s): 1.04 | learning rate: 6.294E-05 | global batch size: 256 | lm loss: 1.941313E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.031 | TFLOPs: 40.49 | 15: iteration 85100/ 125429 | consumed samples: 21785600 | consumed tokens: 44616908800 | elapsed time per iteration (s): 1.04 | learning rate: 6.292E-05 | global batch size: 256 | lm loss: 1.933198E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.930 | TFLOPs: 40.81 | 15: iteration 85110/ 125429 | consumed samples: 21788160 | consumed tokens: 44622151680 | elapsed time per iteration (s): 1.04 | learning rate: 6.290E-05 | global batch size: 256 | lm loss: 1.920444E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.183 | TFLOPs: 40.68 | 15: iteration 85120/ 125429 | consumed samples: 21790720 | consumed tokens: 44627394560 | elapsed time per iteration (s): 1.04 | learning rate: 6.288E-05 | global batch size: 256 | lm loss: 1.924689E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.570 | TFLOPs: 40.75 | 15: iteration 85130/ 125429 | consumed samples: 21793280 | consumed tokens: 44632637440 | elapsed time per iteration (s): 1.06 | learning rate: 6.286E-05 | global batch size: 256 | lm loss: 1.939370E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.768 | TFLOPs: 39.79 | 15: iteration 85140/ 125429 | consumed samples: 21795840 | consumed tokens: 44637880320 | elapsed time per iteration (s): 1.04 | learning rate: 6.284E-05 | global batch size: 256 | lm loss: 1.925523E+00 | grad norm: 0.136 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.896 | TFLOPs: 40.64 | 15: iteration 85150/ 125429 | consumed samples: 21798400 | consumed tokens: 44643123200 | elapsed time per iteration (s): 1.05 | learning rate: 6.282E-05 | global batch size: 256 | lm loss: 1.940366E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.799 | TFLOPs: 40.29 | 15: iteration 85160/ 125429 | consumed samples: 21800960 | consumed tokens: 44648366080 | elapsed time per iteration (s): 1.05 | learning rate: 6.280E-05 | global batch size: 256 | lm loss: 1.951824E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.879 | TFLOPs: 40.47 | 15: iteration 85170/ 125429 | consumed samples: 21803520 | consumed tokens: 44653608960 | elapsed time per iteration (s): 1.08 | learning rate: 6.279E-05 | global batch size: 256 | lm loss: 1.917834E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.860 | TFLOPs: 39.31 | 15: iteration 85180/ 125429 | consumed samples: 21806080 | consumed tokens: 44658851840 | elapsed time per iteration (s): 1.04 | learning rate: 6.277E-05 | global batch size: 256 | lm loss: 1.925547E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.140 | TFLOPs: 40.51 | 15: iteration 85190/ 125429 | consumed samples: 21808640 | consumed tokens: 44664094720 | elapsed time per iteration (s): 1.07 | learning rate: 6.275E-05 | global batch size: 256 | lm loss: 1.921654E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.208 | TFLOPs: 39.70 | 15: iteration 85200/ 125429 | consumed samples: 21811200 | consumed tokens: 44669337600 | elapsed time per iteration (s): 1.04 | learning rate: 6.273E-05 | global batch size: 256 | lm loss: 1.957395E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.493 | TFLOPs: 40.73 | 15: iteration 85210/ 125429 | consumed samples: 21813760 | consumed tokens: 44674580480 | elapsed time per iteration (s): 1.06 | learning rate: 6.271E-05 | global batch size: 256 | lm loss: 1.935339E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.785 | TFLOPs: 39.79 | 15: iteration 85220/ 125429 | consumed samples: 21816320 | consumed tokens: 44679823360 | elapsed time per iteration (s): 1.16 | learning rate: 6.269E-05 | global batch size: 256 | lm loss: 1.961828E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.681 | TFLOPs: 36.47 | 15: iteration 85230/ 125429 | consumed samples: 21818880 | consumed tokens: 44685066240 | elapsed time per iteration (s): 1.05 | learning rate: 6.267E-05 | global batch size: 256 | lm loss: 1.939016E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.469 | TFLOPs: 40.40 | 15: iteration 85240/ 125429 | consumed samples: 21821440 | consumed tokens: 44690309120 | elapsed time per iteration (s): 1.02 | learning rate: 6.265E-05 | global batch size: 256 | lm loss: 1.939497E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.460 | TFLOPs: 41.39 | 15: iteration 85250/ 125429 | consumed samples: 21824000 | consumed tokens: 44695552000 | elapsed time per iteration (s): 1.09 | learning rate: 6.263E-05 | global batch size: 256 | lm loss: 1.927321E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.877 | TFLOPs: 38.98 | 15: iteration 85260/ 125429 | consumed samples: 21826560 | consumed tokens: 44700794880 | elapsed time per iteration (s): 1.04 | learning rate: 6.261E-05 | global batch size: 256 | lm loss: 1.958411E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.371 | TFLOPs: 40.55 | 15: iteration 85270/ 125429 | consumed samples: 21829120 | consumed tokens: 44706037760 | elapsed time per iteration (s): 1.07 | learning rate: 6.259E-05 | global batch size: 256 | lm loss: 1.932821E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.887 | TFLOPs: 39.48 | 15: iteration 85280/ 125429 | consumed samples: 21831680 | consumed tokens: 44711280640 | elapsed time per iteration (s): 1.04 | learning rate: 6.257E-05 | global batch size: 256 | lm loss: 1.922569E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.589 | TFLOPs: 40.59 | 15: iteration 85290/ 125429 | consumed samples: 21834240 | consumed tokens: 44716523520 | elapsed time per iteration (s): 1.05 | learning rate: 6.255E-05 | global batch size: 256 | lm loss: 1.941913E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.260 | TFLOPs: 40.37 | 15: iteration 85300/ 125429 | consumed samples: 21836800 | consumed tokens: 44721766400 | elapsed time per iteration (s): 1.03 | learning rate: 6.253E-05 | global batch size: 256 | lm loss: 1.938330E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.087 | TFLOPs: 41.16 | 15: iteration 85310/ 125429 | consumed samples: 21839360 | consumed tokens: 44727009280 | elapsed time per iteration (s): 1.03 | learning rate: 6.251E-05 | global batch size: 256 | lm loss: 1.935481E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.343 | TFLOPs: 41.04 | 15: iteration 85320/ 125429 | consumed samples: 21841920 | consumed tokens: 44732252160 | elapsed time per iteration (s): 1.04 | learning rate: 6.250E-05 | global batch size: 256 | lm loss: 1.943289E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.314 | TFLOPs: 40.71 | 15: iteration 85330/ 125429 | consumed samples: 21844480 | consumed tokens: 44737495040 | elapsed time per iteration (s): 1.03 | learning rate: 6.248E-05 | global batch size: 256 | lm loss: 1.940450E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.340 | TFLOPs: 41.04 | 15: iteration 85340/ 125429 | consumed samples: 21847040 | consumed tokens: 44742737920 | elapsed time per iteration (s): 1.05 | learning rate: 6.246E-05 | global batch size: 256 | lm loss: 1.945013E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.187 | TFLOPs: 40.35 | 15: iteration 85350/ 125429 | consumed samples: 21849600 | consumed tokens: 44747980800 | elapsed time per iteration (s): 1.05 | learning rate: 6.244E-05 | global batch size: 256 | lm loss: 1.954827E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.403 | TFLOPs: 40.39 | 15: iteration 85360/ 125429 | consumed samples: 21852160 | consumed tokens: 44753223680 | elapsed time per iteration (s): 1.05 | learning rate: 6.242E-05 | global batch size: 256 | lm loss: 1.904120E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.404 | TFLOPs: 40.39 | 15: iteration 85370/ 125429 | consumed samples: 21854720 | consumed tokens: 44758466560 | elapsed time per iteration (s): 1.04 | learning rate: 6.240E-05 | global batch size: 256 | lm loss: 1.929960E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.742 | TFLOPs: 40.78 | 15: iteration 85380/ 125429 | consumed samples: 21857280 | consumed tokens: 44763709440 | elapsed time per iteration (s): 1.03 | learning rate: 6.238E-05 | global batch size: 256 | lm loss: 1.941533E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.627 | TFLOPs: 41.25 | 15: iteration 85390/ 125429 | consumed samples: 21859840 | consumed tokens: 44768952320 | elapsed time per iteration (s): 1.04 | learning rate: 6.236E-05 | global batch size: 256 | lm loss: 1.951668E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.559 | TFLOPs: 40.58 | 15: iteration 85400/ 125429 | consumed samples: 21862400 | consumed tokens: 44774195200 | elapsed time per iteration (s): 1.05 | learning rate: 6.234E-05 | global batch size: 256 | lm loss: 1.924258E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.003 | TFLOPs: 40.32 | 15: iteration 85410/ 125429 | consumed samples: 21864960 | consumed tokens: 44779438080 | elapsed time per iteration (s): 1.03 | learning rate: 6.232E-05 | global batch size: 256 | lm loss: 1.915586E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.378 | TFLOPs: 40.88 | 15: iteration 85420/ 125429 | consumed samples: 21867520 | consumed tokens: 44784680960 | elapsed time per iteration (s): 1.04 | learning rate: 6.230E-05 | global batch size: 256 | lm loss: 1.932611E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.438 | TFLOPs: 40.56 | 15: iteration 85430/ 125429 | consumed samples: 21870080 | consumed tokens: 44789923840 | elapsed time per iteration (s): 1.05 | learning rate: 6.228E-05 | global batch size: 256 | lm loss: 1.940303E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.377 | TFLOPs: 40.39 | 15: iteration 85440/ 125429 | consumed samples: 21872640 | consumed tokens: 44795166720 | elapsed time per iteration (s): 1.05 | learning rate: 6.226E-05 | global batch size: 256 | lm loss: 1.934006E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.899 | TFLOPs: 40.47 | 15: iteration 85450/ 125429 | consumed samples: 21875200 | consumed tokens: 44800409600 | elapsed time per iteration (s): 1.06 | learning rate: 6.224E-05 | global batch size: 256 | lm loss: 1.929735E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.815 | TFLOPs: 39.96 | 15: iteration 85460/ 125429 | consumed samples: 21877760 | consumed tokens: 44805652480 | elapsed time per iteration (s): 1.06 | learning rate: 6.222E-05 | global batch size: 256 | lm loss: 1.957860E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.166 | TFLOPs: 39.85 | 15: iteration 85470/ 125429 | consumed samples: 21880320 | consumed tokens: 44810895360 | elapsed time per iteration (s): 1.05 | learning rate: 6.221E-05 | global batch size: 256 | lm loss: 1.951447E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.788 | TFLOPs: 40.12 | 15: iteration 85480/ 125429 | consumed samples: 21882880 | consumed tokens: 44816138240 | elapsed time per iteration (s): 1.04 | learning rate: 6.219E-05 | global batch size: 256 | lm loss: 1.933601E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.619 | TFLOPs: 40.59 | 15: iteration 85490/ 125429 | consumed samples: 21885440 | consumed tokens: 44821381120 | elapsed time per iteration (s): 1.05 | learning rate: 6.217E-05 | global batch size: 256 | lm loss: 1.939958E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.929 | TFLOPs: 40.31 | 15: iteration 85500/ 125429 | consumed samples: 21888000 | consumed tokens: 44826624000 | elapsed time per iteration (s): 1.02 | learning rate: 6.215E-05 | global batch size: 256 | lm loss: 1.932125E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.819 | TFLOPs: 41.28 | 15: iteration 85510/ 125429 | consumed samples: 21890560 | consumed tokens: 44831866880 | elapsed time per iteration (s): 1.06 | learning rate: 6.213E-05 | global batch size: 256 | lm loss: 1.944830E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.827 | TFLOPs: 39.96 | 15: iteration 85520/ 125429 | consumed samples: 21893120 | consumed tokens: 44837109760 | elapsed time per iteration (s): 1.10 | learning rate: 6.211E-05 | global batch size: 256 | lm loss: 1.965198E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.243 | TFLOPs: 38.55 | 15: iteration 85530/ 125429 | consumed samples: 21895680 | consumed tokens: 44842352640 | elapsed time per iteration (s): 1.06 | learning rate: 6.209E-05 | global batch size: 256 | lm loss: 1.900057E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.118 | TFLOPs: 40.01 | 15: iteration 85540/ 125429 | consumed samples: 21898240 | consumed tokens: 44847595520 | elapsed time per iteration (s): 1.04 | learning rate: 6.207E-05 | global batch size: 256 | lm loss: 1.951668E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.387 | TFLOPs: 40.55 | 15: iteration 85550/ 125429 | consumed samples: 21900800 | consumed tokens: 44852838400 | elapsed time per iteration (s): 2.27 | learning rate: 6.205E-05 | global batch size: 256 | lm loss: 1.961286E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 112.859 | TFLOPs: 18.65 | 15: iteration 85560/ 125429 | consumed samples: 21903360 | consumed tokens: 44858081280 | elapsed time per iteration (s): 1.04 | learning rate: 6.203E-05 | global batch size: 256 | lm loss: 1.952847E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.959 | TFLOPs: 40.65 | 15: iteration 85570/ 125429 | consumed samples: 21905920 | consumed tokens: 44863324160 | elapsed time per iteration (s): 1.07 | learning rate: 6.201E-05 | global batch size: 256 | lm loss: 1.946788E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.964 | TFLOPs: 39.66 | 15: iteration 85580/ 125429 | consumed samples: 21908480 | consumed tokens: 44868567040 | elapsed time per iteration (s): 1.04 | learning rate: 6.199E-05 | global batch size: 256 | lm loss: 1.964020E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.823 | TFLOPs: 40.79 | 15: iteration 85590/ 125429 | consumed samples: 21911040 | consumed tokens: 44873809920 | elapsed time per iteration (s): 1.06 | learning rate: 6.197E-05 | global batch size: 256 | lm loss: 1.952026E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.065 | TFLOPs: 40.00 | 15: iteration 85600/ 125429 | consumed samples: 21913600 | consumed tokens: 44879052800 | elapsed time per iteration (s): 1.03 | learning rate: 6.195E-05 | global batch size: 256 | lm loss: 1.986540E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.112 | TFLOPs: 41.17 | 15: iteration 85610/ 125429 | consumed samples: 21916160 | consumed tokens: 44884295680 | elapsed time per iteration (s): 1.03 | learning rate: 6.194E-05 | global batch size: 256 | lm loss: 1.955949E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.452 | TFLOPs: 41.22 | 15: iteration 85620/ 125429 | consumed samples: 21918720 | consumed tokens: 44889538560 | elapsed time per iteration (s): 1.06 | learning rate: 6.192E-05 | global batch size: 256 | lm loss: 1.930224E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.845 | TFLOPs: 39.97 | 15: iteration 85630/ 125429 | consumed samples: 21921280 | consumed tokens: 44894781440 | elapsed time per iteration (s): 1.04 | learning rate: 6.190E-05 | global batch size: 256 | lm loss: 1.981176E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.507 | TFLOPs: 40.74 | 15: iteration 85640/ 125429 | consumed samples: 21923840 | consumed tokens: 44900024320 | elapsed time per iteration (s): 1.03 | learning rate: 6.188E-05 | global batch size: 256 | lm loss: 1.959126E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.160 | TFLOPs: 41.01 | 15: iteration 85650/ 125429 | consumed samples: 21926400 | consumed tokens: 44905267200 | elapsed time per iteration (s): 1.03 | learning rate: 6.186E-05 | global batch size: 256 | lm loss: 1.921440E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.451 | TFLOPs: 41.06 | 15: iteration 85660/ 125429 | consumed samples: 21928960 | consumed tokens: 44910510080 | elapsed time per iteration (s): 1.04 | learning rate: 6.184E-05 | global batch size: 256 | lm loss: 1.938385E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.071 | TFLOPs: 40.50 | 15: iteration 85670/ 125429 | consumed samples: 21931520 | consumed tokens: 44915752960 | elapsed time per iteration (s): 1.04 | learning rate: 6.182E-05 | global batch size: 256 | lm loss: 1.924245E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.855 | TFLOPs: 40.63 | 15: iteration 85680/ 125429 | consumed samples: 21934080 | consumed tokens: 44920995840 | elapsed time per iteration (s): 1.07 | learning rate: 6.180E-05 | global batch size: 256 | lm loss: 1.937895E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.539 | TFLOPs: 39.59 | 15: iteration 85690/ 125429 | consumed samples: 21936640 | consumed tokens: 44926238720 | elapsed time per iteration (s): 1.05 | learning rate: 6.178E-05 | global batch size: 256 | lm loss: 1.958550E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.689 | TFLOPs: 40.27 | 15: iteration 85700/ 125429 | consumed samples: 21939200 | consumed tokens: 44931481600 | elapsed time per iteration (s): 1.03 | learning rate: 6.176E-05 | global batch size: 256 | lm loss: 1.894785E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.013 | TFLOPs: 41.15 | 15: iteration 85710/ 125429 | consumed samples: 21941760 | consumed tokens: 44936724480 | elapsed time per iteration (s): 1.07 | learning rate: 6.174E-05 | global batch size: 256 | lm loss: 1.898213E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.671 | TFLOPs: 39.61 | 15: iteration 85720/ 125429 | consumed samples: 21944320 | consumed tokens: 44941967360 | elapsed time per iteration (s): 1.04 | learning rate: 6.172E-05 | global batch size: 256 | lm loss: 1.934868E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.882 | TFLOPs: 40.63 | 15: iteration 85730/ 125429 | consumed samples: 21946880 | consumed tokens: 44947210240 | elapsed time per iteration (s): 1.07 | learning rate: 6.170E-05 | global batch size: 256 | lm loss: 1.944416E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.536 | TFLOPs: 39.59 | 15: iteration 85740/ 125429 | consumed samples: 21949440 | consumed tokens: 44952453120 | elapsed time per iteration (s): 1.03 | learning rate: 6.169E-05 | global batch size: 256 | lm loss: 1.944730E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.304 | TFLOPs: 41.03 | 15: iteration 85750/ 125429 | consumed samples: 21952000 | consumed tokens: 44957696000 | elapsed time per iteration (s): 1.06 | learning rate: 6.167E-05 | global batch size: 256 | lm loss: 1.949595E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.391 | TFLOPs: 39.89 | 15: iteration 85760/ 125429 | consumed samples: 21954560 | consumed tokens: 44962938880 | elapsed time per iteration (s): 1.03 | learning rate: 6.165E-05 | global batch size: 256 | lm loss: 1.939011E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.681 | TFLOPs: 41.26 | 15: iteration 85770/ 125429 | consumed samples: 21957120 | consumed tokens: 44968181760 | elapsed time per iteration (s): 1.05 | learning rate: 6.163E-05 | global batch size: 256 | lm loss: 1.921821E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.196 | TFLOPs: 40.36 | 15: iteration 85780/ 125429 | consumed samples: 21959680 | consumed tokens: 44973424640 | elapsed time per iteration (s): 1.03 | learning rate: 6.161E-05 | global batch size: 256 | lm loss: 1.954179E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.543 | TFLOPs: 41.24 | 15: iteration 85790/ 125429 | consumed samples: 21962240 | consumed tokens: 44978667520 | elapsed time per iteration (s): 1.04 | learning rate: 6.159E-05 | global batch size: 256 | lm loss: 1.929878E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.885 | TFLOPs: 40.80 | 15: iteration 85800/ 125429 | consumed samples: 21964800 | consumed tokens: 44983910400 | elapsed time per iteration (s): 1.05 | learning rate: 6.157E-05 | global batch size: 256 | lm loss: 1.907894E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.304 | TFLOPs: 40.37 | 15: iteration 85810/ 125429 | consumed samples: 21967360 | consumed tokens: 44989153280 | elapsed time per iteration (s): 1.05 | learning rate: 6.155E-05 | global batch size: 256 | lm loss: 1.966927E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.853 | TFLOPs: 40.13 | 15: iteration 85820/ 125429 | consumed samples: 21969920 | consumed tokens: 44994396160 | elapsed time per iteration (s): 1.04 | learning rate: 6.153E-05 | global batch size: 256 | lm loss: 1.943591E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.454 | TFLOPs: 40.73 | 15: iteration 85830/ 125429 | consumed samples: 21972480 | consumed tokens: 44999639040 | elapsed time per iteration (s): 1.04 | learning rate: 6.151E-05 | global batch size: 256 | lm loss: 1.934126E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.638 | TFLOPs: 40.59 | 15: iteration 85840/ 125429 | consumed samples: 21975040 | consumed tokens: 45004881920 | elapsed time per iteration (s): 1.03 | learning rate: 6.149E-05 | global batch size: 256 | lm loss: 1.955935E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.885 | TFLOPs: 41.13 | 15: iteration 85850/ 125429 | consumed samples: 21977600 | consumed tokens: 45010124800 | elapsed time per iteration (s): 1.06 | learning rate: 6.147E-05 | global batch size: 256 | lm loss: 1.951660E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.967 | TFLOPs: 39.82 | 15: iteration 85860/ 125429 | consumed samples: 21980160 | consumed tokens: 45015367680 | elapsed time per iteration (s): 1.03 | learning rate: 6.146E-05 | global batch size: 256 | lm loss: 1.977424E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.713 | TFLOPs: 40.94 | 15: iteration 85870/ 125429 | consumed samples: 21982720 | consumed tokens: 45020610560 | elapsed time per iteration (s): 1.06 | learning rate: 6.144E-05 | global batch size: 256 | lm loss: 1.960468E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.835 | TFLOPs: 39.80 | 15: iteration 85880/ 125429 | consumed samples: 21985280 | consumed tokens: 45025853440 | elapsed time per iteration (s): 1.03 | learning rate: 6.142E-05 | global batch size: 256 | lm loss: 1.920990E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.447 | TFLOPs: 40.89 | 15: iteration 85890/ 125429 | consumed samples: 21987840 | consumed tokens: 45031096320 | elapsed time per iteration (s): 1.05 | learning rate: 6.140E-05 | global batch size: 256 | lm loss: 1.924517E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.109 | TFLOPs: 40.18 | 15: iteration 85900/ 125429 | consumed samples: 21990400 | consumed tokens: 45036339200 | elapsed time per iteration (s): 1.03 | learning rate: 6.138E-05 | global batch size: 256 | lm loss: 1.947181E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.793 | TFLOPs: 40.95 | 15: iteration 85910/ 125429 | consumed samples: 21992960 | consumed tokens: 45041582080 | elapsed time per iteration (s): 1.05 | learning rate: 6.136E-05 | global batch size: 256 | lm loss: 1.918600E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.104 | TFLOPs: 40.17 | 15: iteration 85920/ 125429 | consumed samples: 21995520 | consumed tokens: 45046824960 | elapsed time per iteration (s): 1.04 | learning rate: 6.134E-05 | global batch size: 256 | lm loss: 1.950053E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.848 | TFLOPs: 40.79 | 15: iteration 85930/ 125429 | consumed samples: 21998080 | consumed tokens: 45052067840 | elapsed time per iteration (s): 1.06 | learning rate: 6.132E-05 | global batch size: 256 | lm loss: 1.937389E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.914 | TFLOPs: 39.98 | 15: iteration 85940/ 125429 | consumed samples: 22000640 | consumed tokens: 45057310720 | elapsed time per iteration (s): 1.08 | learning rate: 6.130E-05 | global batch size: 256 | lm loss: 1.922072E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.912 | TFLOPs: 39.15 | 15: iteration 85950/ 125429 | consumed samples: 22003200 | consumed tokens: 45062553600 | elapsed time per iteration (s): 1.05 | learning rate: 6.128E-05 | global batch size: 256 | lm loss: 1.935407E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.095 | TFLOPs: 40.34 | 15: iteration 85960/ 125429 | consumed samples: 22005760 | consumed tokens: 45067796480 | elapsed time per iteration (s): 1.04 | learning rate: 6.126E-05 | global batch size: 256 | lm loss: 1.920596E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.783 | TFLOPs: 40.78 | 15: iteration 85970/ 125429 | consumed samples: 22008320 | consumed tokens: 45073039360 | elapsed time per iteration (s): 1.04 | learning rate: 6.124E-05 | global batch size: 256 | lm loss: 1.952301E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.952 | TFLOPs: 40.65 | 15: iteration 85980/ 125429 | consumed samples: 22010880 | consumed tokens: 45078282240 | elapsed time per iteration (s): 1.05 | learning rate: 6.123E-05 | global batch size: 256 | lm loss: 1.913091E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.640 | TFLOPs: 40.43 | 15: iteration 85990/ 125429 | consumed samples: 22013440 | consumed tokens: 45083525120 | elapsed time per iteration (s): 1.03 | learning rate: 6.121E-05 | global batch size: 256 | lm loss: 1.895895E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.116 | TFLOPs: 41.00 | 0: [2022-11-26 21:31:18,655] [INFO] [logging.py:68:log_dist] [Rank 0] step=86000, skipped=0, lr=[6.118708946334324e-05, 6.118708946334324e-05, 6.118708946334324e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 86000/ 125429 | consumed samples: 22016000 | consumed tokens: 45088768000 | elapsed time per iteration (s): 1.13 | learning rate: 6.119E-05 | global batch size: 256 | lm loss: 1.915737E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.216 | TFLOPs: 37.55 | 0: steps: 86000 loss: 1.9382 iter time (s): 1.056 samples/sec: 242.428 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 86000 | lm loss value: 1.916203E+00 | lm loss PPL: 6.795105E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 86000 to checkpoints_1b5 0: [2022-11-26 21:31:19,118] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step86000 is begin to save! 0: [2022-11-26 21:31:19,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:31:19,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:31:19,401] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:31:19,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:31:19,512] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:31:19,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:31:19,629] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:31:19,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:31:19,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:31:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:31:19,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:31:19,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:31:19,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:31:20,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:31:20,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:31:20,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:31:20,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:31:20,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:31:20,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:31:20,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:31:20,379] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:31:20,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:31:20,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:31:20,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:31:20,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:31:20,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:31:20,713] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:31:20,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:31:20,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:31:20,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:31:20,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:31:21,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:31:21,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:31:21,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:31:21,167] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:31:21,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:31:21,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:31:21,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:31:21,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:31:21,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:31:21,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:31:21,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:31:21,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:31:21,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:31:21,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:31:21,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:31:21,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:31:21,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:31:21,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:31:22,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:31:22,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:31:22,157] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:31:22,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:31:22,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:31:22,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_29-model_00-model_states.pt... 0: [2022-11-26 21:31:22,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_29-model_00-model_states.pt. 0: [2022-11-26 21:31:22,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:31:22,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:31:22,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/layer_32-model_00-model_states.pt... 0: [2022-11-26 21:31:22,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/layer_32-model_00-model_states.pt. 0: [2022-11-26 21:31:22,493] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step86000/mp_rank_00_model_states.pt 0: [2022-11-26 21:31:22,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:31:22,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:31:22,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:31:22,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step86000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:31:22,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 21:31:22,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 21:31:22,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 21:31:22,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 21:31:22,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 21:31:22,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 21:31:22,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 21:31:22,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 21:31:22,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 21:31:22,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 21:31:22,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 21:31:22,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:31:22,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 21:31:22,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:31:22,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 21:31:22,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 21:31:22,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 21:31:22,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,724] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,724] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 21:31:22,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 21:31:22,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 21:31:22,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 21:31:22,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 9: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:31:22,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 21:31:22,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 21:31:22,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 21:31:22,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 21:31:22,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 21:31:22,733] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:31:22,733] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,733] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:31:22,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 4: [2022-11-26 21:31:22,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 21:31:22,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 13: [2022-11-26 21:31:22,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 21:31:22,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:31:22,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:31:22,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 15: [2022-11-26 21:31:22,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:31:22,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 21:31:22,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 12: [2022-11-26 21:31:22,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:31:22,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 21:31:22,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:31:22,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:31:22,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 11: [2022-11-26 21:31:22,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:31:22,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 21:31:22,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 21:31:22,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 21:31:22,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:31:22,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:31:22,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 8: [2022-11-26 21:31:22,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,748] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 8: [2022-11-26 21:31:22,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 14: [2022-11-26 21:31:22,748] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 21:31:22,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 1: [2022-11-26 21:31:22,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 10: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,753] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 3: [2022-11-26 21:31:22,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 7: [2022-11-26 21:31:22,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:31:22,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 21:31:22,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 6: [2022-11-26 21:31:22,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:31:22,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:31:22,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 2: [2022-11-26 21:31:22,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:31:22,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:31:22,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 21:31:22,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 5: [2022-11-26 21:31:22,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:31:22,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:31:22,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:31:22,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 21:31:22,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 21:31:22,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step86000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 21:31:22,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: [2022-11-26 21:31:22,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step86000 is ready now! 0: successfully saved checkpoint at iteration 86000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3744.15 15: iteration 86010/ 125429 | consumed samples: 22018560 | consumed tokens: 45094010880 | elapsed time per iteration (s): 1.48 | learning rate: 6.117E-05 | global batch size: 256 | lm loss: 1.932418E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.480 | TFLOPs: 28.50 | 15: iteration 86020/ 125429 | consumed samples: 22021120 | consumed tokens: 45099253760 | elapsed time per iteration (s): 1.02 | learning rate: 6.115E-05 | global batch size: 256 | lm loss: 1.910925E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.964 | TFLOPs: 41.47 | 15: iteration 86030/ 125429 | consumed samples: 22023680 | consumed tokens: 45104496640 | elapsed time per iteration (s): 1.08 | learning rate: 6.113E-05 | global batch size: 256 | lm loss: 1.921182E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.416 | TFLOPs: 39.07 | 15: iteration 86040/ 125429 | consumed samples: 22026240 | consumed tokens: 45109739520 | elapsed time per iteration (s): 1.04 | learning rate: 6.111E-05 | global batch size: 256 | lm loss: 1.926917E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.743 | TFLOPs: 40.61 | 15: iteration 86050/ 125429 | consumed samples: 22028800 | consumed tokens: 45114982400 | elapsed time per iteration (s): 1.03 | learning rate: 6.109E-05 | global batch size: 256 | lm loss: 1.927014E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.492 | TFLOPs: 40.90 | 15: iteration 86060/ 125429 | consumed samples: 22031360 | consumed tokens: 45120225280 | elapsed time per iteration (s): 1.04 | learning rate: 6.107E-05 | global batch size: 256 | lm loss: 1.923412E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.090 | TFLOPs: 40.67 | 15: iteration 86070/ 125429 | consumed samples: 22033920 | consumed tokens: 45125468160 | elapsed time per iteration (s): 1.08 | learning rate: 6.105E-05 | global batch size: 256 | lm loss: 1.949492E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.807 | TFLOPs: 39.30 | 15: iteration 86080/ 125429 | consumed samples: 22036480 | consumed tokens: 45130711040 | elapsed time per iteration (s): 1.04 | learning rate: 6.103E-05 | global batch size: 256 | lm loss: 1.900786E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.120 | TFLOPs: 40.51 | 15: iteration 86090/ 125429 | consumed samples: 22039040 | consumed tokens: 45135953920 | elapsed time per iteration (s): 1.08 | learning rate: 6.102E-05 | global batch size: 256 | lm loss: 1.910366E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.190 | TFLOPs: 39.20 | 15: iteration 86100/ 125429 | consumed samples: 22041600 | consumed tokens: 45141196800 | elapsed time per iteration (s): 1.07 | learning rate: 6.100E-05 | global batch size: 256 | lm loss: 1.945591E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.306 | TFLOPs: 39.71 | 15: iteration 86110/ 125429 | consumed samples: 22044160 | consumed tokens: 45146439680 | elapsed time per iteration (s): 1.08 | learning rate: 6.098E-05 | global batch size: 256 | lm loss: 1.939258E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.008 | TFLOPs: 39.33 | 15: iteration 86120/ 125429 | consumed samples: 22046720 | consumed tokens: 45151682560 | elapsed time per iteration (s): 1.07 | learning rate: 6.096E-05 | global batch size: 256 | lm loss: 1.918618E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.419 | TFLOPs: 39.40 | 15: iteration 86130/ 125429 | consumed samples: 22049280 | consumed tokens: 45156925440 | elapsed time per iteration (s): 1.06 | learning rate: 6.094E-05 | global batch size: 256 | lm loss: 1.948516E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.069 | TFLOPs: 40.00 | 15: iteration 86140/ 125429 | consumed samples: 22051840 | consumed tokens: 45162168320 | elapsed time per iteration (s): 1.05 | learning rate: 6.092E-05 | global batch size: 256 | lm loss: 1.954554E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.258 | TFLOPs: 40.37 | 15: iteration 86150/ 125429 | consumed samples: 22054400 | consumed tokens: 45167411200 | elapsed time per iteration (s): 1.03 | learning rate: 6.090E-05 | global batch size: 256 | lm loss: 1.917302E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.397 | TFLOPs: 40.88 | 15: iteration 86160/ 125429 | consumed samples: 22056960 | consumed tokens: 45172654080 | elapsed time per iteration (s): 1.08 | learning rate: 6.088E-05 | global batch size: 256 | lm loss: 1.935994E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.455 | TFLOPs: 39.08 | 15: iteration 86170/ 125429 | consumed samples: 22059520 | consumed tokens: 45177896960 | elapsed time per iteration (s): 1.07 | learning rate: 6.086E-05 | global batch size: 256 | lm loss: 1.923738E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.130 | TFLOPs: 39.68 | 15: iteration 86180/ 125429 | consumed samples: 22062080 | consumed tokens: 45183139840 | elapsed time per iteration (s): 1.06 | learning rate: 6.084E-05 | global batch size: 256 | lm loss: 1.939940E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.750 | TFLOPs: 39.95 | 15: iteration 86190/ 125429 | consumed samples: 22064640 | consumed tokens: 45188382720 | elapsed time per iteration (s): 1.05 | learning rate: 6.082E-05 | global batch size: 256 | lm loss: 1.935616E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.344 | TFLOPs: 40.21 | 15: iteration 86200/ 125429 | consumed samples: 22067200 | consumed tokens: 45193625600 | elapsed time per iteration (s): 1.05 | learning rate: 6.081E-05 | global batch size: 256 | lm loss: 1.935043E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.215 | TFLOPs: 40.36 | 15: iteration 86210/ 125429 | consumed samples: 22069760 | consumed tokens: 45198868480 | elapsed time per iteration (s): 1.05 | learning rate: 6.079E-05 | global batch size: 256 | lm loss: 1.937836E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.578 | TFLOPs: 40.25 | 15: iteration 86220/ 125429 | consumed samples: 22072320 | consumed tokens: 45204111360 | elapsed time per iteration (s): 1.36 | learning rate: 6.077E-05 | global batch size: 256 | lm loss: 1.923424E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 187.552 | TFLOPs: 30.99 | 15: iteration 86230/ 125429 | consumed samples: 22074880 | consumed tokens: 45209354240 | elapsed time per iteration (s): 1.02 | learning rate: 6.075E-05 | global batch size: 256 | lm loss: 1.912013E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.871 | TFLOPs: 41.46 | 15: iteration 86240/ 125429 | consumed samples: 22077440 | consumed tokens: 45214597120 | elapsed time per iteration (s): 1.07 | learning rate: 6.073E-05 | global batch size: 256 | lm loss: 1.913399E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.106 | TFLOPs: 39.51 | 15: iteration 86250/ 125429 | consumed samples: 22080000 | consumed tokens: 45219840000 | elapsed time per iteration (s): 1.03 | learning rate: 6.071E-05 | global batch size: 256 | lm loss: 1.951875E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.644 | TFLOPs: 41.09 | 15: iteration 86260/ 125429 | consumed samples: 22082560 | consumed tokens: 45225082880 | elapsed time per iteration (s): 1.08 | learning rate: 6.069E-05 | global batch size: 256 | lm loss: 1.914721E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.952 | TFLOPs: 39.32 | 15: iteration 86270/ 125429 | consumed samples: 22085120 | consumed tokens: 45230325760 | elapsed time per iteration (s): 1.06 | learning rate: 6.067E-05 | global batch size: 256 | lm loss: 1.946408E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.321 | TFLOPs: 39.88 | 15: iteration 86280/ 125429 | consumed samples: 22087680 | consumed tokens: 45235568640 | elapsed time per iteration (s): 1.04 | learning rate: 6.065E-05 | global batch size: 256 | lm loss: 1.955143E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.030 | TFLOPs: 40.82 | 15: iteration 86290/ 125429 | consumed samples: 22090240 | consumed tokens: 45240811520 | elapsed time per iteration (s): 1.04 | learning rate: 6.063E-05 | global batch size: 256 | lm loss: 1.966768E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.778 | TFLOPs: 40.78 | 15: iteration 86300/ 125429 | consumed samples: 22092800 | consumed tokens: 45246054400 | elapsed time per iteration (s): 1.03 | learning rate: 6.061E-05 | global batch size: 256 | lm loss: 1.903758E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.947 | TFLOPs: 40.98 | 15: iteration 86310/ 125429 | consumed samples: 22095360 | consumed tokens: 45251297280 | elapsed time per iteration (s): 1.05 | learning rate: 6.060E-05 | global batch size: 256 | lm loss: 1.943593E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.492 | TFLOPs: 40.24 | 15: iteration 86320/ 125429 | consumed samples: 22097920 | consumed tokens: 45256540160 | elapsed time per iteration (s): 1.03 | learning rate: 6.058E-05 | global batch size: 256 | lm loss: 1.921490E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.963 | TFLOPs: 40.98 | 15: iteration 86330/ 125429 | consumed samples: 22100480 | consumed tokens: 45261783040 | elapsed time per iteration (s): 1.05 | learning rate: 6.056E-05 | global batch size: 256 | lm loss: 1.946588E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.829 | TFLOPs: 40.29 | 15: iteration 86340/ 125429 | consumed samples: 22103040 | consumed tokens: 45267025920 | elapsed time per iteration (s): 1.04 | learning rate: 6.054E-05 | global batch size: 256 | lm loss: 1.936268E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.025 | TFLOPs: 40.82 | 15: iteration 86350/ 125429 | consumed samples: 22105600 | consumed tokens: 45272268800 | elapsed time per iteration (s): 1.05 | learning rate: 6.052E-05 | global batch size: 256 | lm loss: 1.920526E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.512 | TFLOPs: 40.24 | 15: iteration 86360/ 125429 | consumed samples: 22108160 | consumed tokens: 45277511680 | elapsed time per iteration (s): 1.04 | learning rate: 6.050E-05 | global batch size: 256 | lm loss: 1.930759E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.899 | TFLOPs: 40.80 | 15: iteration 86370/ 125429 | consumed samples: 22110720 | consumed tokens: 45282754560 | elapsed time per iteration (s): 1.04 | learning rate: 6.048E-05 | global batch size: 256 | lm loss: 1.933515E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.651 | TFLOPs: 40.60 | 15: iteration 86380/ 125429 | consumed samples: 22113280 | consumed tokens: 45287997440 | elapsed time per iteration (s): 1.03 | learning rate: 6.046E-05 | global batch size: 256 | lm loss: 1.926940E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.988 | TFLOPs: 40.98 | 15: iteration 86390/ 125429 | consumed samples: 22115840 | consumed tokens: 45293240320 | elapsed time per iteration (s): 1.06 | learning rate: 6.044E-05 | global batch size: 256 | lm loss: 1.921727E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.054 | TFLOPs: 39.84 | 15: iteration 86400/ 125429 | consumed samples: 22118400 | consumed tokens: 45298483200 | elapsed time per iteration (s): 1.08 | learning rate: 6.042E-05 | global batch size: 256 | lm loss: 1.959200E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.085 | TFLOPs: 39.18 | 15: iteration 86410/ 125429 | consumed samples: 22120960 | consumed tokens: 45303726080 | elapsed time per iteration (s): 1.02 | learning rate: 6.041E-05 | global batch size: 256 | lm loss: 1.951866E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.546 | TFLOPs: 41.57 | 15: iteration 86420/ 125429 | consumed samples: 22123520 | consumed tokens: 45308968960 | elapsed time per iteration (s): 1.04 | learning rate: 6.039E-05 | global batch size: 256 | lm loss: 1.932561E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.664 | TFLOPs: 40.60 | 15: iteration 86430/ 125429 | consumed samples: 22126080 | consumed tokens: 45314211840 | elapsed time per iteration (s): 1.07 | learning rate: 6.037E-05 | global batch size: 256 | lm loss: 1.944964E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.322 | TFLOPs: 39.72 | 15: iteration 86440/ 125429 | consumed samples: 22128640 | consumed tokens: 45319454720 | elapsed time per iteration (s): 1.04 | learning rate: 6.035E-05 | global batch size: 256 | lm loss: 1.955103E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.288 | TFLOPs: 40.54 | 15: iteration 86450/ 125429 | consumed samples: 22131200 | consumed tokens: 45324697600 | elapsed time per iteration (s): 1.06 | learning rate: 6.033E-05 | global batch size: 256 | lm loss: 1.892367E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.853 | TFLOPs: 39.97 | 15: iteration 86460/ 125429 | consumed samples: 22133760 | consumed tokens: 45329940480 | elapsed time per iteration (s): 1.05 | learning rate: 6.031E-05 | global batch size: 256 | lm loss: 1.910396E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.844 | TFLOPs: 40.30 | 15: iteration 86470/ 125429 | consumed samples: 22136320 | consumed tokens: 45335183360 | elapsed time per iteration (s): 1.05 | learning rate: 6.029E-05 | global batch size: 256 | lm loss: 1.915914E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.937 | TFLOPs: 40.48 | 15: iteration 86480/ 125429 | consumed samples: 22138880 | consumed tokens: 45340426240 | elapsed time per iteration (s): 1.04 | learning rate: 6.027E-05 | global batch size: 256 | lm loss: 1.951090E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.855 | TFLOPs: 40.63 | 15: iteration 86490/ 125429 | consumed samples: 22141440 | consumed tokens: 45345669120 | elapsed time per iteration (s): 1.05 | learning rate: 6.025E-05 | global batch size: 256 | lm loss: 1.936729E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.913 | TFLOPs: 40.31 | 15: iteration 86500/ 125429 | consumed samples: 22144000 | consumed tokens: 45350912000 | elapsed time per iteration (s): 1.03 | learning rate: 6.023E-05 | global batch size: 256 | lm loss: 1.941033E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.386 | TFLOPs: 40.88 | 15: iteration 86510/ 125429 | consumed samples: 22146560 | consumed tokens: 45356154880 | elapsed time per iteration (s): 1.05 | learning rate: 6.022E-05 | global batch size: 256 | lm loss: 1.928600E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.624 | TFLOPs: 40.43 | 15: iteration 86520/ 125429 | consumed samples: 22149120 | consumed tokens: 45361397760 | elapsed time per iteration (s): 1.05 | learning rate: 6.020E-05 | global batch size: 256 | lm loss: 1.946265E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.185 | TFLOPs: 40.19 | 15: iteration 86530/ 125429 | consumed samples: 22151680 | consumed tokens: 45366640640 | elapsed time per iteration (s): 1.05 | learning rate: 6.018E-05 | global batch size: 256 | lm loss: 1.933267E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.222 | TFLOPs: 40.19 | 15: iteration 86540/ 125429 | consumed samples: 22154240 | consumed tokens: 45371883520 | elapsed time per iteration (s): 1.03 | learning rate: 6.016E-05 | global batch size: 256 | lm loss: 1.928812E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.655 | TFLOPs: 41.09 | 15: iteration 86550/ 125429 | consumed samples: 22156800 | consumed tokens: 45377126400 | elapsed time per iteration (s): 1.04 | learning rate: 6.014E-05 | global batch size: 256 | lm loss: 1.941635E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.978 | TFLOPs: 40.81 | 15: iteration 86560/ 125429 | consumed samples: 22159360 | consumed tokens: 45382369280 | elapsed time per iteration (s): 1.04 | learning rate: 6.012E-05 | global batch size: 256 | lm loss: 1.938064E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.066 | TFLOPs: 40.66 | 15: iteration 86570/ 125429 | consumed samples: 22161920 | consumed tokens: 45387612160 | elapsed time per iteration (s): 1.03 | learning rate: 6.010E-05 | global batch size: 256 | lm loss: 1.936252E+00 | grad norm: 0.137 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.361 | TFLOPs: 41.21 | 15: iteration 86580/ 125429 | consumed samples: 22164480 | consumed tokens: 45392855040 | elapsed time per iteration (s): 1.03 | learning rate: 6.008E-05 | global batch size: 256 | lm loss: 1.912803E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.871 | TFLOPs: 41.13 | 15: iteration 86590/ 125429 | consumed samples: 22167040 | consumed tokens: 45398097920 | elapsed time per iteration (s): 1.05 | learning rate: 6.006E-05 | global batch size: 256 | lm loss: 1.931777E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.016 | TFLOPs: 40.16 | 15: iteration 86600/ 125429 | consumed samples: 22169600 | consumed tokens: 45403340800 | elapsed time per iteration (s): 1.02 | learning rate: 6.004E-05 | global batch size: 256 | lm loss: 1.955286E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.902 | TFLOPs: 41.46 | 15: iteration 86610/ 125429 | consumed samples: 22172160 | consumed tokens: 45408583680 | elapsed time per iteration (s): 1.07 | learning rate: 6.003E-05 | global batch size: 256 | lm loss: 1.924581E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.365 | TFLOPs: 39.39 | 15: iteration 86620/ 125429 | consumed samples: 22174720 | consumed tokens: 45413826560 | elapsed time per iteration (s): 1.02 | learning rate: 6.001E-05 | global batch size: 256 | lm loss: 1.957558E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.845 | TFLOPs: 41.29 | 15: iteration 86630/ 125429 | consumed samples: 22177280 | consumed tokens: 45419069440 | elapsed time per iteration (s): 1.05 | learning rate: 5.999E-05 | global batch size: 256 | lm loss: 1.932418E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.694 | TFLOPs: 40.11 | 15: iteration 86640/ 125429 | consumed samples: 22179840 | consumed tokens: 45424312320 | elapsed time per iteration (s): 1.06 | learning rate: 5.997E-05 | global batch size: 256 | lm loss: 1.921563E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.944 | TFLOPs: 39.82 | 15: iteration 86650/ 125429 | consumed samples: 22182400 | consumed tokens: 45429555200 | elapsed time per iteration (s): 1.04 | learning rate: 5.995E-05 | global batch size: 256 | lm loss: 1.935501E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.015 | TFLOPs: 40.66 | 15: iteration 86660/ 125429 | consumed samples: 22184960 | consumed tokens: 45434798080 | elapsed time per iteration (s): 1.04 | learning rate: 5.993E-05 | global batch size: 256 | lm loss: 1.945124E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.828 | TFLOPs: 40.63 | 15: iteration 86670/ 125429 | consumed samples: 22187520 | consumed tokens: 45440040960 | elapsed time per iteration (s): 1.07 | learning rate: 5.991E-05 | global batch size: 256 | lm loss: 1.936063E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.996 | TFLOPs: 39.66 | 15: iteration 86680/ 125429 | consumed samples: 22190080 | consumed tokens: 45445283840 | elapsed time per iteration (s): 1.04 | learning rate: 5.989E-05 | global batch size: 256 | lm loss: 1.903417E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.970 | TFLOPs: 40.81 | 15: iteration 86690/ 125429 | consumed samples: 22192640 | consumed tokens: 45450526720 | elapsed time per iteration (s): 1.11 | learning rate: 5.987E-05 | global batch size: 256 | lm loss: 1.945206E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.278 | TFLOPs: 38.22 | 15: iteration 86700/ 125429 | consumed samples: 22195200 | consumed tokens: 45455769600 | elapsed time per iteration (s): 1.05 | learning rate: 5.986E-05 | global batch size: 256 | lm loss: 1.957615E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.026 | TFLOPs: 40.16 | 15: iteration 86710/ 125429 | consumed samples: 22197760 | consumed tokens: 45461012480 | elapsed time per iteration (s): 1.05 | learning rate: 5.984E-05 | global batch size: 256 | lm loss: 1.951924E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.785 | TFLOPs: 40.45 | 15: iteration 86720/ 125429 | consumed samples: 22200320 | consumed tokens: 45466255360 | elapsed time per iteration (s): 1.20 | learning rate: 5.982E-05 | global batch size: 256 | lm loss: 1.937994E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.069 | TFLOPs: 35.38 | 15: iteration 86730/ 125429 | consumed samples: 22202880 | consumed tokens: 45471498240 | elapsed time per iteration (s): 1.06 | learning rate: 5.980E-05 | global batch size: 256 | lm loss: 1.933895E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.966 | TFLOPs: 39.82 | 15: iteration 86740/ 125429 | consumed samples: 22205440 | consumed tokens: 45476741120 | elapsed time per iteration (s): 1.05 | learning rate: 5.978E-05 | global batch size: 256 | lm loss: 1.941268E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.901 | TFLOPs: 40.14 | 15: iteration 86750/ 125429 | consumed samples: 22208000 | consumed tokens: 45481984000 | elapsed time per iteration (s): 1.04 | learning rate: 5.976E-05 | global batch size: 256 | lm loss: 1.901757E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.615 | TFLOPs: 40.59 | 15: iteration 86760/ 125429 | consumed samples: 22210560 | consumed tokens: 45487226880 | elapsed time per iteration (s): 1.03 | learning rate: 5.974E-05 | global batch size: 256 | lm loss: 1.938957E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.458 | TFLOPs: 40.89 | 15: iteration 86770/ 125429 | consumed samples: 22213120 | consumed tokens: 45492469760 | elapsed time per iteration (s): 1.04 | learning rate: 5.972E-05 | global batch size: 256 | lm loss: 1.926978E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.980 | TFLOPs: 40.48 | 15: iteration 86780/ 125429 | consumed samples: 22215680 | consumed tokens: 45497712640 | elapsed time per iteration (s): 1.04 | learning rate: 5.970E-05 | global batch size: 256 | lm loss: 1.967687E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.712 | TFLOPs: 40.61 | 15: iteration 86790/ 125429 | consumed samples: 22218240 | consumed tokens: 45502955520 | elapsed time per iteration (s): 1.04 | learning rate: 5.969E-05 | global batch size: 256 | lm loss: 1.942595E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.890 | TFLOPs: 40.80 | 15: iteration 86800/ 125429 | consumed samples: 22220800 | consumed tokens: 45508198400 | elapsed time per iteration (s): 1.06 | learning rate: 5.967E-05 | global batch size: 256 | lm loss: 1.909679E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.823 | TFLOPs: 39.80 | 15: iteration 86810/ 125429 | consumed samples: 22223360 | consumed tokens: 45513441280 | elapsed time per iteration (s): 1.07 | learning rate: 5.965E-05 | global batch size: 256 | lm loss: 1.944834E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.438 | TFLOPs: 39.57 | 15: iteration 86820/ 125429 | consumed samples: 22225920 | consumed tokens: 45518684160 | elapsed time per iteration (s): 1.05 | learning rate: 5.963E-05 | global batch size: 256 | lm loss: 1.939768E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.725 | TFLOPs: 40.28 | 15: iteration 86830/ 125429 | consumed samples: 22228480 | consumed tokens: 45523927040 | elapsed time per iteration (s): 1.04 | learning rate: 5.961E-05 | global batch size: 256 | lm loss: 1.929978E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.180 | TFLOPs: 40.68 | 15: iteration 86840/ 125429 | consumed samples: 22231040 | consumed tokens: 45529169920 | elapsed time per iteration (s): 1.04 | learning rate: 5.959E-05 | global batch size: 256 | lm loss: 1.906427E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.335 | TFLOPs: 40.54 | 15: iteration 86850/ 125429 | consumed samples: 22233600 | consumed tokens: 45534412800 | elapsed time per iteration (s): 1.05 | learning rate: 5.957E-05 | global batch size: 256 | lm loss: 1.936182E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.469 | TFLOPs: 40.24 | 15: iteration 86860/ 125429 | consumed samples: 22236160 | consumed tokens: 45539655680 | elapsed time per iteration (s): 1.05 | learning rate: 5.955E-05 | global batch size: 256 | lm loss: 1.929819E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.834 | TFLOPs: 40.46 | 15: iteration 86870/ 125429 | consumed samples: 22238720 | consumed tokens: 45544898560 | elapsed time per iteration (s): 1.03 | learning rate: 5.953E-05 | global batch size: 256 | lm loss: 1.933487E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.435 | TFLOPs: 41.22 | 15: iteration 86880/ 125429 | consumed samples: 22241280 | consumed tokens: 45550141440 | elapsed time per iteration (s): 1.04 | learning rate: 5.952E-05 | global batch size: 256 | lm loss: 1.908516E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.804 | TFLOPs: 40.79 | 15: iteration 86890/ 125429 | consumed samples: 22243840 | consumed tokens: 45555384320 | elapsed time per iteration (s): 1.06 | learning rate: 5.950E-05 | global batch size: 256 | lm loss: 1.931869E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.505 | TFLOPs: 39.75 | 15: iteration 86900/ 125429 | consumed samples: 22246400 | consumed tokens: 45560627200 | elapsed time per iteration (s): 1.06 | learning rate: 5.948E-05 | global batch size: 256 | lm loss: 1.932216E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.896 | TFLOPs: 39.98 | 15: iteration 86910/ 125429 | consumed samples: 22248960 | consumed tokens: 45565870080 | elapsed time per iteration (s): 1.05 | learning rate: 5.946E-05 | global batch size: 256 | lm loss: 1.941494E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.634 | TFLOPs: 40.43 | 15: iteration 86920/ 125429 | consumed samples: 22251520 | consumed tokens: 45571112960 | elapsed time per iteration (s): 1.03 | learning rate: 5.944E-05 | global batch size: 256 | lm loss: 1.932394E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.788 | TFLOPs: 40.95 | 15: iteration 86930/ 125429 | consumed samples: 22254080 | consumed tokens: 45576355840 | elapsed time per iteration (s): 1.06 | learning rate: 5.942E-05 | global batch size: 256 | lm loss: 1.926091E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.908 | TFLOPs: 39.98 | 15: iteration 86940/ 125429 | consumed samples: 22256640 | consumed tokens: 45581598720 | elapsed time per iteration (s): 1.03 | learning rate: 5.940E-05 | global batch size: 256 | lm loss: 1.919141E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.270 | TFLOPs: 41.03 | 15: iteration 86950/ 125429 | consumed samples: 22259200 | consumed tokens: 45586841600 | elapsed time per iteration (s): 1.06 | learning rate: 5.938E-05 | global batch size: 256 | lm loss: 1.923268E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.193 | TFLOPs: 39.86 | 15: iteration 86960/ 125429 | consumed samples: 22261760 | consumed tokens: 45592084480 | elapsed time per iteration (s): 1.04 | learning rate: 5.937E-05 | global batch size: 256 | lm loss: 1.909427E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.572 | TFLOPs: 40.58 | 15: iteration 86970/ 125429 | consumed samples: 22264320 | consumed tokens: 45597327360 | elapsed time per iteration (s): 1.02 | learning rate: 5.935E-05 | global batch size: 256 | lm loss: 1.950469E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.288 | TFLOPs: 41.53 | 15: iteration 86980/ 125429 | consumed samples: 22266880 | consumed tokens: 45602570240 | elapsed time per iteration (s): 1.04 | learning rate: 5.933E-05 | global batch size: 256 | lm loss: 1.944886E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.908 | TFLOPs: 40.80 | 15: iteration 86990/ 125429 | consumed samples: 22269440 | consumed tokens: 45607813120 | elapsed time per iteration (s): 1.05 | learning rate: 5.931E-05 | global batch size: 256 | lm loss: 1.937169E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.139 | TFLOPs: 40.35 | 15: iteration 87000/ 125429 | consumed samples: 22272000 | consumed tokens: 45613056000 | elapsed time per iteration (s): 1.05 | learning rate: 5.929E-05 | global batch size: 256 | lm loss: 1.931073E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.736 | TFLOPs: 40.28 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 87000 | lm loss value: 1.867766E+00 | lm loss PPL: 6.473819E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 87000 to checkpoints_1b5 0: [2022-11-26 21:48:56,644] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step87000 is begin to save! 0: [2022-11-26 21:48:56,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_01-model_00-model_states.pt... 0: [2022-11-26 21:48:56,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_01-model_00-model_states.pt. 0: [2022-11-26 21:48:56,887] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_03-model_00-model_states.pt... 0: [2022-11-26 21:48:56,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_03-model_00-model_states.pt. 0: [2022-11-26 21:48:56,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_04-model_00-model_states.pt... 0: [2022-11-26 21:48:57,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_04-model_00-model_states.pt. 0: [2022-11-26 21:48:57,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_05-model_00-model_states.pt... 0: [2022-11-26 21:48:57,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_05-model_00-model_states.pt. 0: [2022-11-26 21:48:57,206] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_06-model_00-model_states.pt... 0: [2022-11-26 21:48:57,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_06-model_00-model_states.pt. 0: [2022-11-26 21:48:57,315] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_07-model_00-model_states.pt... 0: [2022-11-26 21:48:57,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_07-model_00-model_states.pt. 0: [2022-11-26 21:48:57,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_08-model_00-model_states.pt... 0: [2022-11-26 21:48:57,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_08-model_00-model_states.pt. 0: [2022-11-26 21:48:57,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_09-model_00-model_states.pt... 0: [2022-11-26 21:48:57,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_09-model_00-model_states.pt. 0: [2022-11-26 21:48:57,639] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_10-model_00-model_states.pt... 0: [2022-11-26 21:48:57,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_10-model_00-model_states.pt. 0: [2022-11-26 21:48:57,745] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_11-model_00-model_states.pt... 0: [2022-11-26 21:48:57,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_11-model_00-model_states.pt. 0: [2022-11-26 21:48:57,852] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_12-model_00-model_states.pt... 0: [2022-11-26 21:48:57,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_12-model_00-model_states.pt. 0: [2022-11-26 21:48:57,961] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_13-model_00-model_states.pt... 0: [2022-11-26 21:48:58,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_13-model_00-model_states.pt. 0: [2022-11-26 21:48:58,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_14-model_00-model_states.pt... 0: [2022-11-26 21:48:58,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_14-model_00-model_states.pt. 0: [2022-11-26 21:48:58,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_15-model_00-model_states.pt... 0: [2022-11-26 21:48:58,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_15-model_00-model_states.pt. 0: [2022-11-26 21:48:58,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_16-model_00-model_states.pt... 0: [2022-11-26 21:48:58,388] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_16-model_00-model_states.pt. 0: [2022-11-26 21:48:58,388] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_17-model_00-model_states.pt... 0: [2022-11-26 21:48:58,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_17-model_00-model_states.pt. 0: [2022-11-26 21:48:58,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_18-model_00-model_states.pt... 0: [2022-11-26 21:48:58,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_18-model_00-model_states.pt. 0: [2022-11-26 21:48:58,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_19-model_00-model_states.pt... 0: [2022-11-26 21:48:58,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_19-model_00-model_states.pt. 0: [2022-11-26 21:48:58,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_20-model_00-model_states.pt... 0: [2022-11-26 21:48:58,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_20-model_00-model_states.pt. 0: [2022-11-26 21:48:58,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_21-model_00-model_states.pt... 0: [2022-11-26 21:48:58,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_21-model_00-model_states.pt. 0: [2022-11-26 21:48:58,926] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_22-model_00-model_states.pt... 0: [2022-11-26 21:48:59,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_22-model_00-model_states.pt. 0: [2022-11-26 21:48:59,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_23-model_00-model_states.pt... 0: [2022-11-26 21:48:59,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_23-model_00-model_states.pt. 0: [2022-11-26 21:48:59,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_24-model_00-model_states.pt... 0: [2022-11-26 21:48:59,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_24-model_00-model_states.pt. 0: [2022-11-26 21:48:59,244] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_25-model_00-model_states.pt... 0: [2022-11-26 21:48:59,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_25-model_00-model_states.pt. 0: [2022-11-26 21:48:59,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_26-model_00-model_states.pt... 0: [2022-11-26 21:48:59,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_26-model_00-model_states.pt. 0: [2022-11-26 21:48:59,448] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_27-model_00-model_states.pt... 0: [2022-11-26 21:48:59,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_27-model_00-model_states.pt. 0: [2022-11-26 21:48:59,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_28-model_00-model_states.pt... 0: [2022-11-26 21:48:59,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_28-model_00-model_states.pt. 0: [2022-11-26 21:48:59,654] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_29-model_00-model_states.pt... 0: [2022-11-26 21:48:59,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_29-model_00-model_states.pt. 0: [2022-11-26 21:48:59,760] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_30-model_00-model_states.pt... 0: [2022-11-26 21:48:59,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_30-model_00-model_states.pt. 0: [2022-11-26 21:48:59,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/layer_32-model_00-model_states.pt... 0: [2022-11-26 21:48:59,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/layer_32-model_00-model_states.pt. 0: [2022-11-26 21:48:59,871] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step87000/mp_rank_00_model_states.pt 0: [2022-11-26 21:48:59,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/mp_rank_00_model_states.pt... 0: [2022-11-26 21:48:59,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/mp_rank_00_model_states.pt. 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 14: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 8: [2022-11-26 21:48:59,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step87000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 7: [2022-11-26 21:49:00,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 21:49:00,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:00,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:00,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:00,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:00,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 21:49:00,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 21:49:00,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 21:49:00,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 21:49:00,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:00,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:00,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 21:49:00,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:00,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 21:49:00,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 21:49:00,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 5: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 21:49:00,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 7: [2022-11-26 21:49:00,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 21:49:00,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 21:49:00,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:00,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 21:49:00,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:00,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 11: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 21:49:00,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:00,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:00,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:00,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 21:49:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,129] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 21:49:00,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 21:49:00,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:00,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 21:49:00,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 21:49:00,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 8: [2022-11-26 21:49:00,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 10: [2022-11-26 21:49:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 21:49:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 21:49:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,140] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,140] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 1: [2022-11-26 21:49:00,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 21:49:00,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 21:49:00,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:00,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:00,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:00,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:00,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,145] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,147] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 21:49:00,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,147] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 21:49:00,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 12: [2022-11-26 21:49:00,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:00,140] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,141] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,141] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:00,144] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,146] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,147] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:00,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:00,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 13: [2022-11-26 21:49:00,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 21:49:00,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 21:49:00,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 15: [2022-11-26 21:49:00,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 21:49:00,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 21:49:00,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 21:49:00,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,160] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 3: [2022-11-26 21:49:00,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 21:49:00,161] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 21:49:00,161] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 2: [2022-11-26 21:49:00,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 4: [2022-11-26 21:49:00,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 21:49:00,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 21:49:00,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,148] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,148] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 14: [2022-11-26 21:49:00,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 21:49:00,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 21:49:00,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:00,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 21:49:00,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:00,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 6: [2022-11-26 21:49:00,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: [2022-11-26 21:49:00,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 21:49:00,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 21:49:00,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step87000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 21:49:00,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step87000 is ready now! 0: successfully saved checkpoint at iteration 87000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3744.59 15: iteration 87010/ 125429 | consumed samples: 22274560 | consumed tokens: 45618298880 | elapsed time per iteration (s): 1.44 | learning rate: 5.927E-05 | global batch size: 256 | lm loss: 1.918935E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.551 | TFLOPs: 29.34 | 15: iteration 87020/ 125429 | consumed samples: 22277120 | consumed tokens: 45623541760 | elapsed time per iteration (s): 1.02 | learning rate: 5.925E-05 | global batch size: 256 | lm loss: 1.934215E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.109 | TFLOPs: 41.33 | 15: iteration 87030/ 125429 | consumed samples: 22279680 | consumed tokens: 45628784640 | elapsed time per iteration (s): 1.03 | learning rate: 5.923E-05 | global batch size: 256 | lm loss: 1.942985E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.549 | TFLOPs: 41.24 | 15: iteration 87040/ 125429 | consumed samples: 22282240 | consumed tokens: 45634027520 | elapsed time per iteration (s): 1.07 | learning rate: 5.921E-05 | global batch size: 256 | lm loss: 1.936784E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.190 | TFLOPs: 39.69 | 15: iteration 87050/ 125429 | consumed samples: 22284800 | consumed tokens: 45639270400 | elapsed time per iteration (s): 1.06 | learning rate: 5.920E-05 | global batch size: 256 | lm loss: 1.970841E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.159 | TFLOPs: 40.02 | 15: iteration 87060/ 125429 | consumed samples: 22287360 | consumed tokens: 45644513280 | elapsed time per iteration (s): 1.03 | learning rate: 5.918E-05 | global batch size: 256 | lm loss: 1.918560E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.614 | TFLOPs: 41.09 | 15: iteration 87070/ 125429 | consumed samples: 22289920 | consumed tokens: 45649756160 | elapsed time per iteration (s): 1.05 | learning rate: 5.916E-05 | global batch size: 256 | lm loss: 1.917949E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.612 | TFLOPs: 40.42 | 15: iteration 87080/ 125429 | consumed samples: 22292480 | consumed tokens: 45654999040 | elapsed time per iteration (s): 1.05 | learning rate: 5.914E-05 | global batch size: 256 | lm loss: 1.953180E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.603 | TFLOPs: 40.26 | 15: iteration 87090/ 125429 | consumed samples: 22295040 | consumed tokens: 45660241920 | elapsed time per iteration (s): 1.13 | learning rate: 5.912E-05 | global batch size: 256 | lm loss: 1.908809E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.463 | TFLOPs: 37.59 | 15: iteration 87100/ 125429 | consumed samples: 22297600 | consumed tokens: 45665484800 | elapsed time per iteration (s): 1.07 | learning rate: 5.910E-05 | global batch size: 256 | lm loss: 1.944896E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.330 | TFLOPs: 39.55 | 15: iteration 87110/ 125429 | consumed samples: 22300160 | consumed tokens: 45670727680 | elapsed time per iteration (s): 1.07 | learning rate: 5.908E-05 | global batch size: 256 | lm loss: 1.919527E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.675 | TFLOPs: 39.61 | 15: iteration 87120/ 125429 | consumed samples: 22302720 | consumed tokens: 45675970560 | elapsed time per iteration (s): 1.02 | learning rate: 5.906E-05 | global batch size: 256 | lm loss: 1.938432E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.427 | TFLOPs: 41.38 | 15: iteration 87130/ 125429 | consumed samples: 22305280 | consumed tokens: 45681213440 | elapsed time per iteration (s): 1.02 | learning rate: 5.905E-05 | global batch size: 256 | lm loss: 1.921484E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.557 | TFLOPs: 41.41 | 15: iteration 87140/ 125429 | consumed samples: 22307840 | consumed tokens: 45686456320 | elapsed time per iteration (s): 1.07 | learning rate: 5.903E-05 | global batch size: 256 | lm loss: 1.927999E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.157 | TFLOPs: 39.69 | 15: iteration 87150/ 125429 | consumed samples: 22310400 | consumed tokens: 45691699200 | elapsed time per iteration (s): 1.08 | learning rate: 5.901E-05 | global batch size: 256 | lm loss: 1.923590E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.001 | TFLOPs: 39.00 | 15: iteration 87160/ 125429 | consumed samples: 22312960 | consumed tokens: 45696942080 | elapsed time per iteration (s): 1.07 | learning rate: 5.899E-05 | global batch size: 256 | lm loss: 1.925720E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.052 | TFLOPs: 39.67 | 15: iteration 87170/ 125429 | consumed samples: 22315520 | consumed tokens: 45702184960 | elapsed time per iteration (s): 1.05 | learning rate: 5.897E-05 | global batch size: 256 | lm loss: 1.926322E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.205 | TFLOPs: 40.36 | 15: iteration 87180/ 125429 | consumed samples: 22318080 | consumed tokens: 45707427840 | elapsed time per iteration (s): 1.05 | learning rate: 5.895E-05 | global batch size: 256 | lm loss: 1.937736E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.614 | TFLOPs: 40.26 | 15: iteration 87190/ 125429 | consumed samples: 22320640 | consumed tokens: 45712670720 | elapsed time per iteration (s): 1.03 | learning rate: 5.893E-05 | global batch size: 256 | lm loss: 1.945205E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.082 | TFLOPs: 41.00 | 15: iteration 87200/ 125429 | consumed samples: 22323200 | consumed tokens: 45717913600 | elapsed time per iteration (s): 1.04 | learning rate: 5.891E-05 | global batch size: 256 | lm loss: 1.909922E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.198 | TFLOPs: 40.85 | 15: iteration 87210/ 125429 | consumed samples: 22325760 | consumed tokens: 45723156480 | elapsed time per iteration (s): 1.09 | learning rate: 5.890E-05 | global batch size: 256 | lm loss: 1.933358E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.759 | TFLOPs: 38.80 | 15: iteration 87220/ 125429 | consumed samples: 22328320 | consumed tokens: 45728399360 | elapsed time per iteration (s): 1.07 | learning rate: 5.888E-05 | global batch size: 256 | lm loss: 1.944880E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.026 | TFLOPs: 39.50 | 15: iteration 87230/ 125429 | consumed samples: 22330880 | consumed tokens: 45733642240 | elapsed time per iteration (s): 1.03 | learning rate: 5.886E-05 | global batch size: 256 | lm loss: 1.957238E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.103 | TFLOPs: 41.00 | 15: iteration 87240/ 125429 | consumed samples: 22333440 | consumed tokens: 45738885120 | elapsed time per iteration (s): 1.05 | learning rate: 5.884E-05 | global batch size: 256 | lm loss: 1.912666E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.962 | TFLOPs: 40.32 | 15: iteration 87250/ 125429 | consumed samples: 22336000 | consumed tokens: 45744128000 | elapsed time per iteration (s): 1.05 | learning rate: 5.882E-05 | global batch size: 256 | lm loss: 1.934758E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.185 | TFLOPs: 40.35 | 15: iteration 87260/ 125429 | consumed samples: 22338560 | consumed tokens: 45749370880 | elapsed time per iteration (s): 1.07 | learning rate: 5.880E-05 | global batch size: 256 | lm loss: 1.940289E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.705 | TFLOPs: 39.61 | 15: iteration 87270/ 125429 | consumed samples: 22341120 | consumed tokens: 45754613760 | elapsed time per iteration (s): 1.03 | learning rate: 5.878E-05 | global batch size: 256 | lm loss: 1.926889E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.728 | TFLOPs: 40.94 | 15: iteration 87280/ 125429 | consumed samples: 22343680 | consumed tokens: 45759856640 | elapsed time per iteration (s): 1.07 | learning rate: 5.876E-05 | global batch size: 256 | lm loss: 1.942790E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.989 | TFLOPs: 39.66 | 15: iteration 87290/ 125429 | consumed samples: 22346240 | consumed tokens: 45765099520 | elapsed time per iteration (s): 1.06 | learning rate: 5.875E-05 | global batch size: 256 | lm loss: 1.906120E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.326 | TFLOPs: 39.88 | 15: iteration 87300/ 125429 | consumed samples: 22348800 | consumed tokens: 45770342400 | elapsed time per iteration (s): 1.08 | learning rate: 5.873E-05 | global batch size: 256 | lm loss: 1.925707E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.812 | TFLOPs: 39.30 | 15: iteration 87310/ 125429 | consumed samples: 22351360 | consumed tokens: 45775585280 | elapsed time per iteration (s): 1.04 | learning rate: 5.871E-05 | global batch size: 256 | lm loss: 1.929787E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.275 | TFLOPs: 40.70 | 15: iteration 87320/ 125429 | consumed samples: 22353920 | consumed tokens: 45780828160 | elapsed time per iteration (s): 1.05 | learning rate: 5.869E-05 | global batch size: 256 | lm loss: 1.946362E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.693 | TFLOPs: 40.11 | 15: iteration 87330/ 125429 | consumed samples: 22356480 | consumed tokens: 45786071040 | elapsed time per iteration (s): 1.16 | learning rate: 5.867E-05 | global batch size: 256 | lm loss: 1.913390E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.480 | TFLOPs: 36.44 | 15: iteration 87340/ 125429 | consumed samples: 22359040 | consumed tokens: 45791313920 | elapsed time per iteration (s): 1.08 | learning rate: 5.865E-05 | global batch size: 256 | lm loss: 1.907705E+00 | grad norm: 0.139 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.977 | TFLOPs: 39.33 | 15: iteration 87350/ 125429 | consumed samples: 22361600 | consumed tokens: 45796556800 | elapsed time per iteration (s): 1.03 | learning rate: 5.863E-05 | global batch size: 256 | lm loss: 1.946882E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.211 | TFLOPs: 41.18 | 15: iteration 87360/ 125429 | consumed samples: 22364160 | consumed tokens: 45801799680 | elapsed time per iteration (s): 1.05 | learning rate: 5.861E-05 | global batch size: 256 | lm loss: 1.924627E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.290 | TFLOPs: 40.37 | 15: iteration 87370/ 125429 | consumed samples: 22366720 | consumed tokens: 45807042560 | elapsed time per iteration (s): 1.05 | learning rate: 5.860E-05 | global batch size: 256 | lm loss: 1.932610E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.904 | TFLOPs: 40.31 | 15: iteration 87380/ 125429 | consumed samples: 22369280 | consumed tokens: 45812285440 | elapsed time per iteration (s): 1.03 | learning rate: 5.858E-05 | global batch size: 256 | lm loss: 1.903458E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.842 | TFLOPs: 41.12 | 15: iteration 87390/ 125429 | consumed samples: 22371840 | consumed tokens: 45817528320 | elapsed time per iteration (s): 1.03 | learning rate: 5.856E-05 | global batch size: 256 | lm loss: 1.910881E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.577 | TFLOPs: 41.08 | 15: iteration 87400/ 125429 | consumed samples: 22374400 | consumed tokens: 45822771200 | elapsed time per iteration (s): 1.05 | learning rate: 5.854E-05 | global batch size: 256 | lm loss: 1.904538E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.399 | TFLOPs: 40.39 | 15: iteration 87410/ 125429 | consumed samples: 22376960 | consumed tokens: 45828014080 | elapsed time per iteration (s): 1.04 | learning rate: 5.852E-05 | global batch size: 256 | lm loss: 1.953069E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.033 | TFLOPs: 40.66 | 15: iteration 87420/ 125429 | consumed samples: 22379520 | consumed tokens: 45833256960 | elapsed time per iteration (s): 1.17 | learning rate: 5.850E-05 | global batch size: 256 | lm loss: 1.949048E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.479 | TFLOPs: 36.27 | 15: iteration 87430/ 125429 | consumed samples: 22382080 | consumed tokens: 45838499840 | elapsed time per iteration (s): 1.04 | learning rate: 5.848E-05 | global batch size: 256 | lm loss: 1.935454E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.944 | TFLOPs: 40.81 | 15: iteration 87440/ 125429 | consumed samples: 22384640 | consumed tokens: 45843742720 | elapsed time per iteration (s): 1.05 | learning rate: 5.847E-05 | global batch size: 256 | lm loss: 1.913807E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.881 | TFLOPs: 40.30 | 15: iteration 87450/ 125429 | consumed samples: 22387200 | consumed tokens: 45848985600 | elapsed time per iteration (s): 1.03 | learning rate: 5.845E-05 | global batch size: 256 | lm loss: 1.925626E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.648 | TFLOPs: 41.26 | 15: iteration 87460/ 125429 | consumed samples: 22389760 | consumed tokens: 45854228480 | elapsed time per iteration (s): 1.05 | learning rate: 5.843E-05 | global batch size: 256 | lm loss: 1.934916E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.775 | TFLOPs: 40.45 | 15: iteration 87470/ 125429 | consumed samples: 22392320 | consumed tokens: 45859471360 | elapsed time per iteration (s): 1.04 | learning rate: 5.841E-05 | global batch size: 256 | lm loss: 1.937432E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.243 | TFLOPs: 40.69 | 15: iteration 87480/ 125429 | consumed samples: 22394880 | consumed tokens: 45864714240 | elapsed time per iteration (s): 1.07 | learning rate: 5.839E-05 | global batch size: 256 | lm loss: 1.916284E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.490 | TFLOPs: 39.58 | 15: iteration 87490/ 125429 | consumed samples: 22397440 | consumed tokens: 45869957120 | elapsed time per iteration (s): 1.05 | learning rate: 5.837E-05 | global batch size: 256 | lm loss: 1.966252E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.863 | TFLOPs: 40.13 | 15: iteration 87500/ 125429 | consumed samples: 22400000 | consumed tokens: 45875200000 | elapsed time per iteration (s): 1.04 | learning rate: 5.835E-05 | global batch size: 256 | lm loss: 1.965101E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.488 | TFLOPs: 40.73 | 15: iteration 87510/ 125429 | consumed samples: 22402560 | consumed tokens: 45880442880 | elapsed time per iteration (s): 1.05 | learning rate: 5.833E-05 | global batch size: 256 | lm loss: 1.947380E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.833 | TFLOPs: 40.46 | 15: iteration 87520/ 125429 | consumed samples: 22405120 | consumed tokens: 45885685760 | elapsed time per iteration (s): 1.03 | learning rate: 5.832E-05 | global batch size: 256 | lm loss: 1.926142E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.434 | TFLOPs: 41.22 | 15: iteration 87530/ 125429 | consumed samples: 22407680 | consumed tokens: 45890928640 | elapsed time per iteration (s): 1.03 | learning rate: 5.830E-05 | global batch size: 256 | lm loss: 1.945552E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.759 | TFLOPs: 41.11 | 15: iteration 87540/ 125429 | consumed samples: 22410240 | consumed tokens: 45896171520 | elapsed time per iteration (s): 1.04 | learning rate: 5.828E-05 | global batch size: 256 | lm loss: 1.961201E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.128 | TFLOPs: 40.84 | 15: iteration 87550/ 125429 | consumed samples: 22412800 | consumed tokens: 45901414400 | elapsed time per iteration (s): 1.04 | learning rate: 5.826E-05 | global batch size: 256 | lm loss: 1.920079E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.450 | TFLOPs: 40.73 | 15: iteration 87560/ 125429 | consumed samples: 22415360 | consumed tokens: 45906657280 | elapsed time per iteration (s): 1.04 | learning rate: 5.824E-05 | global batch size: 256 | lm loss: 1.923329E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.184 | TFLOPs: 40.85 | 15: iteration 87570/ 125429 | consumed samples: 22417920 | consumed tokens: 45911900160 | elapsed time per iteration (s): 1.07 | learning rate: 5.822E-05 | global batch size: 256 | lm loss: 1.953431E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.367 | TFLOPs: 39.72 | 15: iteration 87580/ 125429 | consumed samples: 22420480 | consumed tokens: 45917143040 | elapsed time per iteration (s): 1.03 | learning rate: 5.820E-05 | global batch size: 256 | lm loss: 1.921515E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.893 | TFLOPs: 40.97 | 15: iteration 87590/ 125429 | consumed samples: 22423040 | consumed tokens: 45922385920 | elapsed time per iteration (s): 1.04 | learning rate: 5.819E-05 | global batch size: 256 | lm loss: 1.941573E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.049 | TFLOPs: 40.83 | 15: iteration 87600/ 125429 | consumed samples: 22425600 | consumed tokens: 45927628800 | elapsed time per iteration (s): 1.06 | learning rate: 5.817E-05 | global batch size: 256 | lm loss: 1.949747E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.077 | TFLOPs: 40.01 | 15: iteration 87610/ 125429 | consumed samples: 22428160 | consumed tokens: 45932871680 | elapsed time per iteration (s): 1.05 | learning rate: 5.815E-05 | global batch size: 256 | lm loss: 1.935812E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.089 | TFLOPs: 40.34 | 15: iteration 87620/ 125429 | consumed samples: 22430720 | consumed tokens: 45938114560 | elapsed time per iteration (s): 1.06 | learning rate: 5.813E-05 | global batch size: 256 | lm loss: 1.907863E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.262 | TFLOPs: 39.87 | 15: iteration 87630/ 125429 | consumed samples: 22433280 | consumed tokens: 45943357440 | elapsed time per iteration (s): 1.05 | learning rate: 5.811E-05 | global batch size: 256 | lm loss: 1.943731E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.008 | TFLOPs: 40.16 | 15: iteration 87640/ 125429 | consumed samples: 22435840 | consumed tokens: 45948600320 | elapsed time per iteration (s): 1.05 | learning rate: 5.809E-05 | global batch size: 256 | lm loss: 1.949717E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.718 | TFLOPs: 40.44 | 15: iteration 87650/ 125429 | consumed samples: 22438400 | consumed tokens: 45953843200 | elapsed time per iteration (s): 1.04 | learning rate: 5.807E-05 | global batch size: 256 | lm loss: 1.948492E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.692 | TFLOPs: 40.60 | 15: iteration 87660/ 125429 | consumed samples: 22440960 | consumed tokens: 45959086080 | elapsed time per iteration (s): 1.09 | learning rate: 5.806E-05 | global batch size: 256 | lm loss: 1.927865E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.568 | TFLOPs: 38.93 | 15: iteration 87670/ 125429 | consumed samples: 22443520 | consumed tokens: 45964328960 | elapsed time per iteration (s): 1.05 | learning rate: 5.804E-05 | global batch size: 256 | lm loss: 1.949376E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.687 | TFLOPs: 40.44 | 15: iteration 87680/ 125429 | consumed samples: 22446080 | consumed tokens: 45969571840 | elapsed time per iteration (s): 1.04 | learning rate: 5.802E-05 | global batch size: 256 | lm loss: 1.923682E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.162 | TFLOPs: 40.85 | 15: iteration 87690/ 125429 | consumed samples: 22448640 | consumed tokens: 45974814720 | elapsed time per iteration (s): 1.05 | learning rate: 5.800E-05 | global batch size: 256 | lm loss: 1.940091E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.562 | TFLOPs: 40.42 | 15: iteration 87700/ 125429 | consumed samples: 22451200 | consumed tokens: 45980057600 | elapsed time per iteration (s): 1.02 | learning rate: 5.798E-05 | global batch size: 256 | lm loss: 1.922920E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.868 | TFLOPs: 41.29 | 15: iteration 87710/ 125429 | consumed samples: 22453760 | consumed tokens: 45985300480 | elapsed time per iteration (s): 1.03 | learning rate: 5.796E-05 | global batch size: 256 | lm loss: 1.935808E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.419 | TFLOPs: 41.05 | 15: iteration 87720/ 125429 | consumed samples: 22456320 | consumed tokens: 45990543360 | elapsed time per iteration (s): 1.08 | learning rate: 5.794E-05 | global batch size: 256 | lm loss: 1.914542E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.418 | TFLOPs: 39.07 | 15: iteration 87730/ 125429 | consumed samples: 22458880 | consumed tokens: 45995786240 | elapsed time per iteration (s): 1.06 | learning rate: 5.793E-05 | global batch size: 256 | lm loss: 1.929073E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.375 | TFLOPs: 40.05 | 15: iteration 87740/ 125429 | consumed samples: 22461440 | consumed tokens: 46001029120 | elapsed time per iteration (s): 1.04 | learning rate: 5.791E-05 | global batch size: 256 | lm loss: 1.942569E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.272 | TFLOPs: 40.70 | 15: iteration 87750/ 125429 | consumed samples: 22464000 | consumed tokens: 46006272000 | elapsed time per iteration (s): 1.05 | learning rate: 5.789E-05 | global batch size: 256 | lm loss: 1.915389E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.112 | TFLOPs: 40.18 | 15: iteration 87760/ 125429 | consumed samples: 22466560 | consumed tokens: 46011514880 | elapsed time per iteration (s): 1.18 | learning rate: 5.787E-05 | global batch size: 256 | lm loss: 1.946927E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.546 | TFLOPs: 35.79 | 15: iteration 87770/ 125429 | consumed samples: 22469120 | consumed tokens: 46016757760 | elapsed time per iteration (s): 1.06 | learning rate: 5.785E-05 | global batch size: 256 | lm loss: 1.930759E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.479 | TFLOPs: 40.07 | 15: iteration 87780/ 125429 | consumed samples: 22471680 | consumed tokens: 46022000640 | elapsed time per iteration (s): 1.05 | learning rate: 5.783E-05 | global batch size: 256 | lm loss: 1.958007E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.157 | TFLOPs: 40.35 | 15: iteration 87790/ 125429 | consumed samples: 22474240 | consumed tokens: 46027243520 | elapsed time per iteration (s): 1.08 | learning rate: 5.781E-05 | global batch size: 256 | lm loss: 1.936374E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.233 | TFLOPs: 39.20 | 15: iteration 87800/ 125429 | consumed samples: 22476800 | consumed tokens: 46032486400 | elapsed time per iteration (s): 1.12 | learning rate: 5.780E-05 | global batch size: 256 | lm loss: 1.948768E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.439 | TFLOPs: 37.92 | 15: iteration 87810/ 125429 | consumed samples: 22479360 | consumed tokens: 46037729280 | elapsed time per iteration (s): 1.06 | learning rate: 5.778E-05 | global batch size: 256 | lm loss: 1.951736E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.037 | TFLOPs: 39.83 | 15: iteration 87820/ 125429 | consumed samples: 22481920 | consumed tokens: 46042972160 | elapsed time per iteration (s): 1.04 | learning rate: 5.776E-05 | global batch size: 256 | lm loss: 1.934398E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.092 | TFLOPs: 40.50 | 15: iteration 87830/ 125429 | consumed samples: 22484480 | consumed tokens: 46048215040 | elapsed time per iteration (s): 1.05 | learning rate: 5.774E-05 | global batch size: 256 | lm loss: 1.932306E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.259 | TFLOPs: 40.37 | 15: iteration 87840/ 125429 | consumed samples: 22487040 | consumed tokens: 46053457920 | elapsed time per iteration (s): 1.05 | learning rate: 5.772E-05 | global batch size: 256 | lm loss: 1.900895E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.273 | TFLOPs: 40.37 | 15: iteration 87850/ 125429 | consumed samples: 22489600 | consumed tokens: 46058700800 | elapsed time per iteration (s): 1.09 | learning rate: 5.770E-05 | global batch size: 256 | lm loss: 1.932669E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.271 | TFLOPs: 38.88 | 15: iteration 87860/ 125429 | consumed samples: 22492160 | consumed tokens: 46063943680 | elapsed time per iteration (s): 1.03 | learning rate: 5.768E-05 | global batch size: 256 | lm loss: 1.925903E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.494 | TFLOPs: 40.90 | 15: iteration 87870/ 125429 | consumed samples: 22494720 | consumed tokens: 46069186560 | elapsed time per iteration (s): 1.07 | learning rate: 5.767E-05 | global batch size: 256 | lm loss: 1.930270E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.112 | TFLOPs: 39.68 | 15: iteration 87880/ 125429 | consumed samples: 22497280 | consumed tokens: 46074429440 | elapsed time per iteration (s): 1.07 | learning rate: 5.765E-05 | global batch size: 256 | lm loss: 1.916989E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.019 | TFLOPs: 39.50 | 15: iteration 87890/ 125429 | consumed samples: 22499840 | consumed tokens: 46079672320 | elapsed time per iteration (s): 1.04 | learning rate: 5.763E-05 | global batch size: 256 | lm loss: 1.912740E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.677 | TFLOPs: 40.77 | 15: iteration 87900/ 125429 | consumed samples: 22502400 | consumed tokens: 46084915200 | elapsed time per iteration (s): 1.07 | learning rate: 5.761E-05 | global batch size: 256 | lm loss: 1.909734E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.191 | TFLOPs: 39.53 | 15: iteration 87910/ 125429 | consumed samples: 22504960 | consumed tokens: 46090158080 | elapsed time per iteration (s): 1.06 | learning rate: 5.759E-05 | global batch size: 256 | lm loss: 1.942450E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.438 | TFLOPs: 39.73 | 15: iteration 87920/ 125429 | consumed samples: 22507520 | consumed tokens: 46095400960 | elapsed time per iteration (s): 1.04 | learning rate: 5.757E-05 | global batch size: 256 | lm loss: 1.949863E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.981 | TFLOPs: 40.49 | 15: iteration 87930/ 125429 | consumed samples: 22510080 | consumed tokens: 46100643840 | elapsed time per iteration (s): 1.05 | learning rate: 5.755E-05 | global batch size: 256 | lm loss: 1.929156E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.329 | TFLOPs: 40.38 | 15: iteration 87940/ 125429 | consumed samples: 22512640 | consumed tokens: 46105886720 | elapsed time per iteration (s): 1.22 | learning rate: 5.754E-05 | global batch size: 256 | lm loss: 1.900740E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 209.547 | TFLOPs: 34.63 | 15: iteration 87950/ 125429 | consumed samples: 22515200 | consumed tokens: 46111129600 | elapsed time per iteration (s): 1.08 | learning rate: 5.752E-05 | global batch size: 256 | lm loss: 1.905406E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.970 | TFLOPs: 39.33 | 15: iteration 87960/ 125429 | consumed samples: 22517760 | consumed tokens: 46116372480 | elapsed time per iteration (s): 1.05 | learning rate: 5.750E-05 | global batch size: 256 | lm loss: 1.913997E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.426 | TFLOPs: 40.23 | 15: iteration 87970/ 125429 | consumed samples: 22520320 | consumed tokens: 46121615360 | elapsed time per iteration (s): 1.03 | learning rate: 5.748E-05 | global batch size: 256 | lm loss: 1.944276E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.896 | TFLOPs: 41.13 | 15: iteration 87980/ 125429 | consumed samples: 22522880 | consumed tokens: 46126858240 | elapsed time per iteration (s): 1.04 | learning rate: 5.746E-05 | global batch size: 256 | lm loss: 1.940541E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.296 | TFLOPs: 40.54 | 15: iteration 87990/ 125429 | consumed samples: 22525440 | consumed tokens: 46132101120 | elapsed time per iteration (s): 1.04 | learning rate: 5.744E-05 | global batch size: 256 | lm loss: 1.928323E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.338 | TFLOPs: 40.54 | 0: [2022-11-26 22:06:37,063] [INFO] [logging.py:68:log_dist] [Rank 0] step=88000, skipped=0, lr=[5.74252229328452e-05, 5.74252229328452e-05, 5.74252229328452e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 88000/ 125429 | consumed samples: 22528000 | consumed tokens: 46137344000 | elapsed time per iteration (s): 1.09 | learning rate: 5.743E-05 | global batch size: 256 | lm loss: 1.940840E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.456 | TFLOPs: 38.75 | 0: steps: 88000 loss: 1.9329 iter time (s): 1.052 samples/sec: 243.260 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 88000 | lm loss value: 1.913851E+00 | lm loss PPL: 6.779142E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 88000 to checkpoints_1b5 0: [2022-11-26 22:06:37,436] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step88000 is begin to save! 0: [2022-11-26 22:06:37,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:06:37,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:06:37,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:06:37,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:06:37,819] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:06:37,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:06:37,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:06:38,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:06:38,053] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:06:38,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:06:38,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:06:38,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:06:38,277] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:06:38,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:06:38,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:06:38,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:06:38,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:06:38,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:06:38,609] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:06:38,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:06:38,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:06:38,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:06:38,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:06:38,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:06:38,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:06:39,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:06:39,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:06:39,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:06:39,184] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:06:39,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:06:39,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:06:39,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:06:39,405] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:06:39,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:06:39,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:06:39,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:06:39,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:06:39,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:06:39,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:06:39,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:06:39,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:06:39,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:06:39,958] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:06:40,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:06:40,059] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:06:40,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:06:40,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:06:40,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:06:40,286] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:06:40,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:06:40,399] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:06:40,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:06:40,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:06:40,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:06:40,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_29-model_00-model_states.pt... 0: [2022-11-26 22:06:40,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_29-model_00-model_states.pt. 0: [2022-11-26 22:06:40,718] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:06:40,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:06:40,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/layer_32-model_00-model_states.pt... 0: [2022-11-26 22:06:40,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/layer_32-model_00-model_states.pt. 0: [2022-11-26 22:06:40,836] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step88000/mp_rank_00_model_states.pt 0: [2022-11-26 22:06:40,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:06:40,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:06:40,877] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step88000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:06:41,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:06:41,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 22:06:41,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 22:06:41,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 22:06:41,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 22:06:41,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:06:41,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:06:41,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:06:41,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 22:06:41,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:06:41,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:06:41,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:06:41,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:06:41,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,068] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:06:41,068] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 22:06:41,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:06:41,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:06:41,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:06:41,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 22:06:41,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:06:41,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 22:06:41,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:06:41,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:06:41,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:06:41,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 10: [2022-11-26 22:06:41,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:06:41,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 22:06:41,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 22:06:41,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:06:41,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 3: [2022-11-26 22:06:41,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:06:41,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 22:06:41,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 22:06:41,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 22:06:41,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 11: [2022-11-26 22:06:41,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:06:41,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 22:06:41,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 8: [2022-11-26 22:06:41,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:06:41,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 22:06:41,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 14: [2022-11-26 22:06:41,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:06:41,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 22:06:41,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 7: [2022-11-26 22:06:41,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:06:41,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:06:41,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:06:41,121] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:06:41,121] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 6: [2022-11-26 22:06:41,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:06:41,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:06:41,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:06:41,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,125] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 22:06:41,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 2: [2022-11-26 22:06:41,125] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 9: [2022-11-26 22:06:41,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 13: [2022-11-26 22:06:41,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:06:41,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 22:06:41,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,155] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 1: [2022-11-26 22:06:41,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:06:41,156] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 22:06:41,156] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 5: [2022-11-26 22:06:41,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:06:41,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 22:06:41,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,170] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,170] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:06:41,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,170] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:06:41,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 15: [2022-11-26 22:06:41,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 12: [2022-11-26 22:06:41,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: [2022-11-26 22:06:41,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 22:06:41,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 4: [2022-11-26 22:06:41,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step88000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:06:41,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step88000 is ready now! 0: successfully saved checkpoint at iteration 88000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3890.34 15: iteration 88010/ 125429 | consumed samples: 22530560 | consumed tokens: 46142586880 | elapsed time per iteration (s): 1.63 | learning rate: 5.741E-05 | global batch size: 256 | lm loss: 1.897254E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 156.873 | TFLOPs: 25.92 | 15: iteration 88020/ 125429 | consumed samples: 22533120 | consumed tokens: 46147829760 | elapsed time per iteration (s): 1.04 | learning rate: 5.739E-05 | global batch size: 256 | lm loss: 1.940846E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.536 | TFLOPs: 40.74 | 15: iteration 88030/ 125429 | consumed samples: 22535680 | consumed tokens: 46153072640 | elapsed time per iteration (s): 1.03 | learning rate: 5.737E-05 | global batch size: 256 | lm loss: 1.921420E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.761 | TFLOPs: 40.94 | 15: iteration 88040/ 125429 | consumed samples: 22538240 | consumed tokens: 46158315520 | elapsed time per iteration (s): 1.02 | learning rate: 5.735E-05 | global batch size: 256 | lm loss: 1.910178E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.193 | TFLOPs: 41.35 | 15: iteration 88050/ 125429 | consumed samples: 22540800 | consumed tokens: 46163558400 | elapsed time per iteration (s): 1.05 | learning rate: 5.733E-05 | global batch size: 256 | lm loss: 1.942630E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.506 | TFLOPs: 40.24 | 15: iteration 88060/ 125429 | consumed samples: 22543360 | consumed tokens: 46168801280 | elapsed time per iteration (s): 1.18 | learning rate: 5.731E-05 | global batch size: 256 | lm loss: 1.905654E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.991 | TFLOPs: 35.86 | 15: iteration 88070/ 125429 | consumed samples: 22545920 | consumed tokens: 46174044160 | elapsed time per iteration (s): 1.05 | learning rate: 5.730E-05 | global batch size: 256 | lm loss: 1.937743E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.297 | TFLOPs: 40.21 | 15: iteration 88080/ 125429 | consumed samples: 22548480 | consumed tokens: 46179287040 | elapsed time per iteration (s): 1.03 | learning rate: 5.728E-05 | global batch size: 256 | lm loss: 1.919658E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.173 | TFLOPs: 41.18 | 15: iteration 88090/ 125429 | consumed samples: 22551040 | consumed tokens: 46184529920 | elapsed time per iteration (s): 1.03 | learning rate: 5.726E-05 | global batch size: 256 | lm loss: 1.939561E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.607 | TFLOPs: 41.08 | 15: iteration 88100/ 125429 | consumed samples: 22553600 | consumed tokens: 46189772800 | elapsed time per iteration (s): 1.03 | learning rate: 5.724E-05 | global batch size: 256 | lm loss: 1.907514E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.650 | TFLOPs: 41.09 | 15: iteration 88110/ 125429 | consumed samples: 22556160 | consumed tokens: 46195015680 | elapsed time per iteration (s): 1.04 | learning rate: 5.722E-05 | global batch size: 256 | lm loss: 1.918433E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.199 | TFLOPs: 40.52 | 15: iteration 88120/ 125429 | consumed samples: 22558720 | consumed tokens: 46200258560 | elapsed time per iteration (s): 1.06 | learning rate: 5.720E-05 | global batch size: 256 | lm loss: 1.916140E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.458 | TFLOPs: 39.90 | 15: iteration 88130/ 125429 | consumed samples: 22561280 | consumed tokens: 46205501440 | elapsed time per iteration (s): 1.16 | learning rate: 5.719E-05 | global batch size: 256 | lm loss: 1.927281E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.743 | TFLOPs: 36.31 | 15: iteration 88140/ 125429 | consumed samples: 22563840 | consumed tokens: 46210744320 | elapsed time per iteration (s): 1.05 | learning rate: 5.717E-05 | global batch size: 256 | lm loss: 1.913950E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.970 | TFLOPs: 40.15 | 15: iteration 88150/ 125429 | consumed samples: 22566400 | consumed tokens: 46215987200 | elapsed time per iteration (s): 1.02 | learning rate: 5.715E-05 | global batch size: 256 | lm loss: 1.918702E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.613 | TFLOPs: 41.42 | 15: iteration 88160/ 125429 | consumed samples: 22568960 | consumed tokens: 46221230080 | elapsed time per iteration (s): 1.05 | learning rate: 5.713E-05 | global batch size: 256 | lm loss: 1.909737E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.326 | TFLOPs: 40.38 | 15: iteration 88170/ 125429 | consumed samples: 22571520 | consumed tokens: 46226472960 | elapsed time per iteration (s): 1.03 | learning rate: 5.711E-05 | global batch size: 256 | lm loss: 1.921199E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.102 | TFLOPs: 41.00 | 15: iteration 88180/ 125429 | consumed samples: 22574080 | consumed tokens: 46231715840 | elapsed time per iteration (s): 1.08 | learning rate: 5.709E-05 | global batch size: 256 | lm loss: 1.932845E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.453 | TFLOPs: 39.24 | 15: iteration 88190/ 125429 | consumed samples: 22576640 | consumed tokens: 46236958720 | elapsed time per iteration (s): 1.04 | learning rate: 5.707E-05 | global batch size: 256 | lm loss: 1.937719E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.927 | TFLOPs: 40.81 | 15: iteration 88200/ 125429 | consumed samples: 22579200 | consumed tokens: 46242201600 | elapsed time per iteration (s): 1.03 | learning rate: 5.706E-05 | global batch size: 256 | lm loss: 1.944050E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.410 | TFLOPs: 41.05 | 15: iteration 88210/ 125429 | consumed samples: 22581760 | consumed tokens: 46247444480 | elapsed time per iteration (s): 1.02 | learning rate: 5.704E-05 | global batch size: 256 | lm loss: 1.925221E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.394 | TFLOPs: 41.38 | 15: iteration 88220/ 125429 | consumed samples: 22584320 | consumed tokens: 46252687360 | elapsed time per iteration (s): 1.07 | learning rate: 5.702E-05 | global batch size: 256 | lm loss: 1.932582E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.988 | TFLOPs: 39.66 | 15: iteration 88230/ 125429 | consumed samples: 22586880 | consumed tokens: 46257930240 | elapsed time per iteration (s): 1.04 | learning rate: 5.700E-05 | global batch size: 256 | lm loss: 1.963193E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.184 | TFLOPs: 40.52 | 15: iteration 88240/ 125429 | consumed samples: 22589440 | consumed tokens: 46263173120 | elapsed time per iteration (s): 1.04 | learning rate: 5.698E-05 | global batch size: 256 | lm loss: 1.908304E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.930 | TFLOPs: 40.64 | 15: iteration 88250/ 125429 | consumed samples: 22592000 | consumed tokens: 46268416000 | elapsed time per iteration (s): 1.07 | learning rate: 5.696E-05 | global batch size: 256 | lm loss: 1.967206E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.760 | TFLOPs: 39.46 | 15: iteration 88260/ 125429 | consumed samples: 22594560 | consumed tokens: 46273658880 | elapsed time per iteration (s): 1.06 | learning rate: 5.695E-05 | global batch size: 256 | lm loss: 1.928236E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.596 | TFLOPs: 39.93 | 15: iteration 88270/ 125429 | consumed samples: 22597120 | consumed tokens: 46278901760 | elapsed time per iteration (s): 1.17 | learning rate: 5.693E-05 | global batch size: 256 | lm loss: 1.920135E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.635 | TFLOPs: 36.30 | 15: iteration 88280/ 125429 | consumed samples: 22599680 | consumed tokens: 46284144640 | elapsed time per iteration (s): 1.06 | learning rate: 5.691E-05 | global batch size: 256 | lm loss: 1.941848E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.631 | TFLOPs: 39.77 | 15: iteration 88290/ 125429 | consumed samples: 22602240 | consumed tokens: 46289387520 | elapsed time per iteration (s): 1.05 | learning rate: 5.689E-05 | global batch size: 256 | lm loss: 1.948808E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.230 | TFLOPs: 40.20 | 15: iteration 88300/ 125429 | consumed samples: 22604800 | consumed tokens: 46294630400 | elapsed time per iteration (s): 1.16 | learning rate: 5.687E-05 | global batch size: 256 | lm loss: 1.949073E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.591 | TFLOPs: 36.45 | 15: iteration 88310/ 125429 | consumed samples: 22607360 | consumed tokens: 46299873280 | elapsed time per iteration (s): 1.09 | learning rate: 5.685E-05 | global batch size: 256 | lm loss: 1.916991E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.836 | TFLOPs: 38.64 | 15: iteration 88320/ 125429 | consumed samples: 22609920 | consumed tokens: 46305116160 | elapsed time per iteration (s): 1.18 | learning rate: 5.684E-05 | global batch size: 256 | lm loss: 1.893258E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.882 | TFLOPs: 35.84 | 15: iteration 88330/ 125429 | consumed samples: 22612480 | consumed tokens: 46310359040 | elapsed time per iteration (s): 1.08 | learning rate: 5.682E-05 | global batch size: 256 | lm loss: 1.934881E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.933 | TFLOPs: 39.32 | 15: iteration 88340/ 125429 | consumed samples: 22615040 | consumed tokens: 46315601920 | elapsed time per iteration (s): 1.07 | learning rate: 5.680E-05 | global batch size: 256 | lm loss: 1.900080E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.728 | TFLOPs: 39.62 | 15: iteration 88350/ 125429 | consumed samples: 22617600 | consumed tokens: 46320844800 | elapsed time per iteration (s): 1.08 | learning rate: 5.678E-05 | global batch size: 256 | lm loss: 1.899603E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.898 | TFLOPs: 39.31 | 15: iteration 88360/ 125429 | consumed samples: 22620160 | consumed tokens: 46326087680 | elapsed time per iteration (s): 1.04 | learning rate: 5.676E-05 | global batch size: 256 | lm loss: 1.931623E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.266 | TFLOPs: 40.70 | 15: iteration 88370/ 125429 | consumed samples: 22622720 | consumed tokens: 46331330560 | elapsed time per iteration (s): 1.02 | learning rate: 5.674E-05 | global batch size: 256 | lm loss: 1.944509E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.118 | TFLOPs: 41.33 | 15: iteration 88380/ 125429 | consumed samples: 22625280 | consumed tokens: 46336573440 | elapsed time per iteration (s): 1.04 | learning rate: 5.673E-05 | global batch size: 256 | lm loss: 1.919429E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.094 | TFLOPs: 40.83 | 15: iteration 88390/ 125429 | consumed samples: 22627840 | consumed tokens: 46341816320 | elapsed time per iteration (s): 1.04 | learning rate: 5.671E-05 | global batch size: 256 | lm loss: 1.951238E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.252 | TFLOPs: 40.86 | 15: iteration 88400/ 125429 | consumed samples: 22630400 | consumed tokens: 46347059200 | elapsed time per iteration (s): 1.59 | learning rate: 5.669E-05 | global batch size: 256 | lm loss: 1.901805E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 161.424 | TFLOPs: 26.68 | 15: iteration 88410/ 125429 | consumed samples: 22632960 | consumed tokens: 46352302080 | elapsed time per iteration (s): 1.17 | learning rate: 5.667E-05 | global batch size: 256 | lm loss: 1.906263E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.094 | TFLOPs: 36.04 | 15: iteration 88420/ 125429 | consumed samples: 22635520 | consumed tokens: 46357544960 | elapsed time per iteration (s): 1.05 | learning rate: 5.665E-05 | global batch size: 256 | lm loss: 1.933408E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.746 | TFLOPs: 40.45 | 15: iteration 88430/ 125429 | consumed samples: 22638080 | consumed tokens: 46362787840 | elapsed time per iteration (s): 1.07 | learning rate: 5.663E-05 | global batch size: 256 | lm loss: 1.922926E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.346 | TFLOPs: 39.39 | 15: iteration 88440/ 125429 | consumed samples: 22640640 | consumed tokens: 46368030720 | elapsed time per iteration (s): 1.11 | learning rate: 5.662E-05 | global batch size: 256 | lm loss: 1.901670E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.538 | TFLOPs: 38.26 | 15: iteration 88450/ 125429 | consumed samples: 22643200 | consumed tokens: 46373273600 | elapsed time per iteration (s): 1.11 | learning rate: 5.660E-05 | global batch size: 256 | lm loss: 1.937149E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.940 | TFLOPs: 38.00 | 15: iteration 88460/ 125429 | consumed samples: 22645760 | consumed tokens: 46378516480 | elapsed time per iteration (s): 1.04 | learning rate: 5.658E-05 | global batch size: 256 | lm loss: 1.942538E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.495 | TFLOPs: 40.57 | 15: iteration 88470/ 125429 | consumed samples: 22648320 | consumed tokens: 46383759360 | elapsed time per iteration (s): 1.03 | learning rate: 5.656E-05 | global batch size: 256 | lm loss: 1.965054E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.732 | TFLOPs: 41.10 | 15: iteration 88480/ 125429 | consumed samples: 22650880 | consumed tokens: 46389002240 | elapsed time per iteration (s): 1.07 | learning rate: 5.654E-05 | global batch size: 256 | lm loss: 1.936371E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.292 | TFLOPs: 39.71 | 15: iteration 88490/ 125429 | consumed samples: 22653440 | consumed tokens: 46394245120 | elapsed time per iteration (s): 1.05 | learning rate: 5.652E-05 | global batch size: 256 | lm loss: 1.928083E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.953 | TFLOPs: 40.15 | 15: iteration 88500/ 125429 | consumed samples: 22656000 | consumed tokens: 46399488000 | elapsed time per iteration (s): 1.04 | learning rate: 5.651E-05 | global batch size: 256 | lm loss: 1.969404E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.421 | TFLOPs: 40.56 | 15: iteration 88510/ 125429 | consumed samples: 22658560 | consumed tokens: 46404730880 | elapsed time per iteration (s): 1.05 | learning rate: 5.649E-05 | global batch size: 256 | lm loss: 1.929172E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.911 | TFLOPs: 40.31 | 15: iteration 88520/ 125429 | consumed samples: 22661120 | consumed tokens: 46409973760 | elapsed time per iteration (s): 1.03 | learning rate: 5.647E-05 | global batch size: 256 | lm loss: 1.921518E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.665 | TFLOPs: 41.09 | 15: iteration 88530/ 125429 | consumed samples: 22663680 | consumed tokens: 46415216640 | elapsed time per iteration (s): 1.04 | learning rate: 5.645E-05 | global batch size: 256 | lm loss: 1.910504E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.802 | TFLOPs: 40.79 | 15: iteration 88540/ 125429 | consumed samples: 22666240 | consumed tokens: 46420459520 | elapsed time per iteration (s): 1.05 | learning rate: 5.643E-05 | global batch size: 256 | lm loss: 1.941263E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.798 | TFLOPs: 40.29 | 15: iteration 88550/ 125429 | consumed samples: 22668800 | consumed tokens: 46425702400 | elapsed time per iteration (s): 1.07 | learning rate: 5.641E-05 | global batch size: 256 | lm loss: 1.959752E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.840 | TFLOPs: 39.47 | 15: iteration 88560/ 125429 | consumed samples: 22671360 | consumed tokens: 46430945280 | elapsed time per iteration (s): 1.04 | learning rate: 5.640E-05 | global batch size: 256 | lm loss: 1.921104E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.563 | TFLOPs: 40.58 | 15: iteration 88570/ 125429 | consumed samples: 22673920 | consumed tokens: 46436188160 | elapsed time per iteration (s): 1.05 | learning rate: 5.638E-05 | global batch size: 256 | lm loss: 1.920688E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.296 | TFLOPs: 40.37 | 15: iteration 88580/ 125429 | consumed samples: 22676480 | consumed tokens: 46441431040 | elapsed time per iteration (s): 1.06 | learning rate: 5.636E-05 | global batch size: 256 | lm loss: 1.949712E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.437 | TFLOPs: 40.06 | 15: iteration 88590/ 125429 | consumed samples: 22679040 | consumed tokens: 46446673920 | elapsed time per iteration (s): 1.03 | learning rate: 5.634E-05 | global batch size: 256 | lm loss: 1.909017E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.744 | TFLOPs: 40.94 | 15: iteration 88600/ 125429 | consumed samples: 22681600 | consumed tokens: 46451916800 | elapsed time per iteration (s): 1.04 | learning rate: 5.632E-05 | global batch size: 256 | lm loss: 1.906713E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.887 | TFLOPs: 40.63 | 15: iteration 88610/ 125429 | consumed samples: 22684160 | consumed tokens: 46457159680 | elapsed time per iteration (s): 1.06 | learning rate: 5.630E-05 | global batch size: 256 | lm loss: 1.922870E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.574 | TFLOPs: 39.76 | 15: iteration 88620/ 125429 | consumed samples: 22686720 | consumed tokens: 46462402560 | elapsed time per iteration (s): 1.04 | learning rate: 5.629E-05 | global batch size: 256 | lm loss: 1.935162E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.377 | TFLOPs: 40.55 | 15: iteration 88630/ 125429 | consumed samples: 22689280 | consumed tokens: 46467645440 | elapsed time per iteration (s): 1.04 | learning rate: 5.627E-05 | global batch size: 256 | lm loss: 1.945528E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.642 | TFLOPs: 40.59 | 15: iteration 88640/ 125429 | consumed samples: 22691840 | consumed tokens: 46472888320 | elapsed time per iteration (s): 1.07 | learning rate: 5.625E-05 | global batch size: 256 | lm loss: 1.932312E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.035 | TFLOPs: 39.50 | 15: iteration 88650/ 125429 | consumed samples: 22694400 | consumed tokens: 46478131200 | elapsed time per iteration (s): 1.03 | learning rate: 5.623E-05 | global batch size: 256 | lm loss: 1.914568E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.215 | TFLOPs: 41.18 | 15: iteration 88660/ 125429 | consumed samples: 22696960 | consumed tokens: 46483374080 | elapsed time per iteration (s): 1.04 | learning rate: 5.621E-05 | global batch size: 256 | lm loss: 1.933567E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.272 | TFLOPs: 40.53 | 15: iteration 88670/ 125429 | consumed samples: 22699520 | consumed tokens: 46488616960 | elapsed time per iteration (s): 1.04 | learning rate: 5.619E-05 | global batch size: 256 | lm loss: 1.937294E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.003 | TFLOPs: 40.49 | 15: iteration 88680/ 125429 | consumed samples: 22702080 | consumed tokens: 46493859840 | elapsed time per iteration (s): 1.06 | learning rate: 5.618E-05 | global batch size: 256 | lm loss: 1.938075E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.815 | TFLOPs: 39.80 | 15: iteration 88690/ 125429 | consumed samples: 22704640 | consumed tokens: 46499102720 | elapsed time per iteration (s): 1.12 | learning rate: 5.616E-05 | global batch size: 256 | lm loss: 1.907737E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.374 | TFLOPs: 37.74 | 15: iteration 88700/ 125429 | consumed samples: 22707200 | consumed tokens: 46504345600 | elapsed time per iteration (s): 1.08 | learning rate: 5.614E-05 | global batch size: 256 | lm loss: 1.932070E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.287 | TFLOPs: 39.21 | 15: iteration 88710/ 125429 | consumed samples: 22709760 | consumed tokens: 46509588480 | elapsed time per iteration (s): 1.08 | learning rate: 5.612E-05 | global batch size: 256 | lm loss: 1.923743E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.500 | TFLOPs: 39.25 | 15: iteration 88720/ 125429 | consumed samples: 22712320 | consumed tokens: 46514831360 | elapsed time per iteration (s): 1.07 | learning rate: 5.610E-05 | global batch size: 256 | lm loss: 1.902535E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.458 | TFLOPs: 39.41 | 15: iteration 88730/ 125429 | consumed samples: 22714880 | consumed tokens: 46520074240 | elapsed time per iteration (s): 1.03 | learning rate: 5.609E-05 | global batch size: 256 | lm loss: 1.952076E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.885 | TFLOPs: 40.96 | 15: iteration 88740/ 125429 | consumed samples: 22717440 | consumed tokens: 46525317120 | elapsed time per iteration (s): 1.05 | learning rate: 5.607E-05 | global batch size: 256 | lm loss: 1.933196E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.491 | TFLOPs: 40.40 | 15: iteration 88750/ 125429 | consumed samples: 22720000 | consumed tokens: 46530560000 | elapsed time per iteration (s): 1.12 | learning rate: 5.605E-05 | global batch size: 256 | lm loss: 1.927959E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.407 | TFLOPs: 37.75 | 15: iteration 88760/ 125429 | consumed samples: 22722560 | consumed tokens: 46535802880 | elapsed time per iteration (s): 1.05 | learning rate: 5.603E-05 | global batch size: 256 | lm loss: 1.931554E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.311 | TFLOPs: 40.21 | 15: iteration 88770/ 125429 | consumed samples: 22725120 | consumed tokens: 46541045760 | elapsed time per iteration (s): 1.09 | learning rate: 5.601E-05 | global batch size: 256 | lm loss: 1.931355E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.816 | TFLOPs: 38.64 | 15: iteration 88780/ 125429 | consumed samples: 22727680 | consumed tokens: 46546288640 | elapsed time per iteration (s): 1.06 | learning rate: 5.599E-05 | global batch size: 256 | lm loss: 1.928018E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.455 | TFLOPs: 39.90 | 15: iteration 88790/ 125429 | consumed samples: 22730240 | consumed tokens: 46551531520 | elapsed time per iteration (s): 1.06 | learning rate: 5.598E-05 | global batch size: 256 | lm loss: 1.905800E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.772 | TFLOPs: 39.95 | 15: iteration 88800/ 125429 | consumed samples: 22732800 | consumed tokens: 46556774400 | elapsed time per iteration (s): 1.14 | learning rate: 5.596E-05 | global batch size: 256 | lm loss: 1.958062E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.822 | TFLOPs: 37.15 | 15: iteration 88810/ 125429 | consumed samples: 22735360 | consumed tokens: 46562017280 | elapsed time per iteration (s): 1.12 | learning rate: 5.594E-05 | global batch size: 256 | lm loss: 1.923784E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.436 | TFLOPs: 37.92 | 15: iteration 88820/ 125429 | consumed samples: 22737920 | consumed tokens: 46567260160 | elapsed time per iteration (s): 1.05 | learning rate: 5.592E-05 | global batch size: 256 | lm loss: 1.950200E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.818 | TFLOPs: 40.13 | 15: iteration 88830/ 125429 | consumed samples: 22740480 | consumed tokens: 46572503040 | elapsed time per iteration (s): 1.04 | learning rate: 5.590E-05 | global batch size: 256 | lm loss: 1.919725E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.527 | TFLOPs: 40.74 | 15: iteration 88840/ 125429 | consumed samples: 22743040 | consumed tokens: 46577745920 | elapsed time per iteration (s): 1.05 | learning rate: 5.588E-05 | global batch size: 256 | lm loss: 1.951572E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.761 | TFLOPs: 40.12 | 15: iteration 88850/ 125429 | consumed samples: 22745600 | consumed tokens: 46582988800 | elapsed time per iteration (s): 1.13 | learning rate: 5.587E-05 | global batch size: 256 | lm loss: 1.925725E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.873 | TFLOPs: 37.33 | 15: iteration 88860/ 125429 | consumed samples: 22748160 | consumed tokens: 46588231680 | elapsed time per iteration (s): 1.03 | learning rate: 5.585E-05 | global batch size: 256 | lm loss: 1.909567E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.808 | TFLOPs: 41.12 | 15: iteration 88870/ 125429 | consumed samples: 22750720 | consumed tokens: 46593474560 | elapsed time per iteration (s): 1.10 | learning rate: 5.583E-05 | global batch size: 256 | lm loss: 1.925199E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.826 | TFLOPs: 38.31 | 15: iteration 88880/ 125429 | consumed samples: 22753280 | consumed tokens: 46598717440 | elapsed time per iteration (s): 1.03 | learning rate: 5.581E-05 | global batch size: 256 | lm loss: 1.911659E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.256 | TFLOPs: 41.19 | 15: iteration 88890/ 125429 | consumed samples: 22755840 | consumed tokens: 46603960320 | elapsed time per iteration (s): 1.09 | learning rate: 5.579E-05 | global batch size: 256 | lm loss: 1.948512E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.814 | TFLOPs: 38.97 | 15: iteration 88900/ 125429 | consumed samples: 22758400 | consumed tokens: 46609203200 | elapsed time per iteration (s): 1.04 | learning rate: 5.578E-05 | global batch size: 256 | lm loss: 1.947225E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.079 | TFLOPs: 40.83 | 15: iteration 88910/ 125429 | consumed samples: 22760960 | consumed tokens: 46614446080 | elapsed time per iteration (s): 1.04 | learning rate: 5.576E-05 | global batch size: 256 | lm loss: 1.918671E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.495 | TFLOPs: 40.57 | 15: iteration 88920/ 125429 | consumed samples: 22763520 | consumed tokens: 46619688960 | elapsed time per iteration (s): 1.06 | learning rate: 5.574E-05 | global batch size: 256 | lm loss: 1.930315E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.456 | TFLOPs: 40.07 | 15: iteration 88930/ 125429 | consumed samples: 22766080 | consumed tokens: 46624931840 | elapsed time per iteration (s): 1.04 | learning rate: 5.572E-05 | global batch size: 256 | lm loss: 1.919316E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.352 | TFLOPs: 40.71 | 15: iteration 88940/ 125429 | consumed samples: 22768640 | consumed tokens: 46630174720 | elapsed time per iteration (s): 1.19 | learning rate: 5.570E-05 | global batch size: 256 | lm loss: 1.919969E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.054 | TFLOPs: 35.54 | 15: iteration 88950/ 125429 | consumed samples: 22771200 | consumed tokens: 46635417600 | elapsed time per iteration (s): 1.07 | learning rate: 5.568E-05 | global batch size: 256 | lm loss: 1.947864E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.177 | TFLOPs: 39.69 | 15: iteration 88960/ 125429 | consumed samples: 22773760 | consumed tokens: 46640660480 | elapsed time per iteration (s): 1.06 | learning rate: 5.567E-05 | global batch size: 256 | lm loss: 1.916872E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.658 | TFLOPs: 39.94 | 15: iteration 88970/ 125429 | consumed samples: 22776320 | consumed tokens: 46645903360 | elapsed time per iteration (s): 1.07 | learning rate: 5.565E-05 | global batch size: 256 | lm loss: 1.923365E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.351 | TFLOPs: 39.72 | 15: iteration 88980/ 125429 | consumed samples: 22778880 | consumed tokens: 46651146240 | elapsed time per iteration (s): 1.06 | learning rate: 5.563E-05 | global batch size: 256 | lm loss: 1.906017E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.850 | TFLOPs: 39.80 | 15: iteration 88990/ 125429 | consumed samples: 22781440 | consumed tokens: 46656389120 | elapsed time per iteration (s): 1.06 | learning rate: 5.561E-05 | global batch size: 256 | lm loss: 1.925963E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.704 | TFLOPs: 39.78 | 15: iteration 89000/ 125429 | consumed samples: 22784000 | consumed tokens: 46661632000 | elapsed time per iteration (s): 1.06 | learning rate: 5.559E-05 | global batch size: 256 | lm loss: 1.894232E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.044 | TFLOPs: 39.83 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 89000 | lm loss value: 1.802145E+00 | lm loss PPL: 6.062638E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 89000 to checkpoints_1b5 0: [2022-11-26 22:24:33,201] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step89000 is begin to save! 0: [2022-11-26 22:24:33,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:24:33,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:24:33,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:24:33,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:24:33,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:24:33,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:24:33,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:24:33,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:24:33,890] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:24:34,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:24:34,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:24:34,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:24:34,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:24:34,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:24:34,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:24:34,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:24:34,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:24:34,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:24:34,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:24:34,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:24:34,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:24:34,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:24:34,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:24:34,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:24:34,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:24:34,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:24:34,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:24:34,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:24:34,970] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:24:35,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:24:35,075] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:24:35,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:24:35,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:24:35,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:24:35,286] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:24:35,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:24:35,387] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:24:35,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:24:35,495] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:24:35,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:24:35,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:24:35,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:24:35,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:24:35,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:24:35,809] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:24:35,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:24:35,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:24:36,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:24:36,019] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:24:36,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:24:36,125] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:24:36,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:24:36,228] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:24:36,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:24:36,334] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_29-model_00-model_states.pt... 0: [2022-11-26 22:24:36,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_29-model_00-model_states.pt. 0: [2022-11-26 22:24:36,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:24:36,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:24:36,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/layer_32-model_00-model_states.pt... 0: [2022-11-26 22:24:36,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/layer_32-model_00-model_states.pt. 0: [2022-11-26 22:24:36,557] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step89000/mp_rank_00_model_states.pt 0: [2022-11-26 22:24:36,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:24:36,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:24:36,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step89000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:24:36,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:24:36,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:24:36,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:24:36,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:24:36,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:24:36,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:24:36,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 22:24:36,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 22:24:36,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 22:24:36,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 6: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:24:36,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:24:36,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:24:36,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:24:36,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 22:24:36,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:36,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:36,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:36,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:36,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:24:36,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:36,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 8: [2022-11-26 22:24:36,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 22:24:36,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:24:36,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:24:36,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 22:24:36,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 22:24:36,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:24:36,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 22:24:36,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:24:36,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 22:24:36,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 11: [2022-11-26 22:24:36,799] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:24:36,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 12: [2022-11-26 22:24:36,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:24:36,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:24:36,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:24:36,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 22:24:36,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 5: [2022-11-26 22:24:36,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 22:24:36,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 6: [2022-11-26 22:24:36,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:24:36,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:24:36,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 4: [2022-11-26 22:24:36,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:24:36,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:24:36,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 13: [2022-11-26 22:24:36,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:24:36,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 7: [2022-11-26 22:24:36,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 22:24:36,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 8: [2022-11-26 22:24:36,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:24:36,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 9: [2022-11-26 22:24:36,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 22:24:36,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:24:36,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:24:36,881] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:24:36,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,881] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,881] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 10: [2022-11-26 22:24:36,892] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 2: [2022-11-26 22:24:36,893] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:24:36,893] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 22:24:36,893] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:24:36,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:24:36,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 1: [2022-11-26 22:24:36,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,890] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:24:36,890] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:24:36,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:24:36,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 3: [2022-11-26 22:24:36,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:24:36,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 22:24:36,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:36,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:36,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:36,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:24:36,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:36,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 14: [2022-11-26 22:24:36,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 22:24:36,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: [2022-11-26 22:24:37,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 22:24:37,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step89000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 15: [2022-11-26 22:24:37,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step89000 is ready now! 0: successfully saved checkpoint at iteration 89000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3873.32 15: iteration 89010/ 125429 | consumed samples: 22786560 | consumed tokens: 46666874880 | elapsed time per iteration (s): 1.50 | learning rate: 5.558E-05 | global batch size: 256 | lm loss: 1.916829E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 170.613 | TFLOPs: 28.20 | 15: iteration 89020/ 125429 | consumed samples: 22789120 | consumed tokens: 46672117760 | elapsed time per iteration (s): 1.04 | learning rate: 5.556E-05 | global batch size: 256 | lm loss: 1.924686E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.946 | TFLOPs: 40.64 | 15: iteration 89030/ 125429 | consumed samples: 22791680 | consumed tokens: 46677360640 | elapsed time per iteration (s): 1.08 | learning rate: 5.554E-05 | global batch size: 256 | lm loss: 1.934031E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.375 | TFLOPs: 39.06 | 15: iteration 89040/ 125429 | consumed samples: 22794240 | consumed tokens: 46682603520 | elapsed time per iteration (s): 1.05 | learning rate: 5.552E-05 | global batch size: 256 | lm loss: 1.915140E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.051 | TFLOPs: 40.33 | 15: iteration 89050/ 125429 | consumed samples: 22796800 | consumed tokens: 46687846400 | elapsed time per iteration (s): 1.05 | learning rate: 5.550E-05 | global batch size: 256 | lm loss: 1.933557E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.155 | TFLOPs: 40.18 | 15: iteration 89060/ 125429 | consumed samples: 22799360 | consumed tokens: 46693089280 | elapsed time per iteration (s): 1.05 | learning rate: 5.549E-05 | global batch size: 256 | lm loss: 1.911145E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.530 | TFLOPs: 40.41 | 15: iteration 89070/ 125429 | consumed samples: 22801920 | consumed tokens: 46698332160 | elapsed time per iteration (s): 1.05 | learning rate: 5.547E-05 | global batch size: 256 | lm loss: 1.933843E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.991 | TFLOPs: 40.32 | 15: iteration 89080/ 125429 | consumed samples: 22804480 | consumed tokens: 46703575040 | elapsed time per iteration (s): 1.09 | learning rate: 5.545E-05 | global batch size: 256 | lm loss: 1.937151E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.940 | TFLOPs: 38.83 | 15: iteration 89090/ 125429 | consumed samples: 22807040 | consumed tokens: 46708817920 | elapsed time per iteration (s): 1.05 | learning rate: 5.543E-05 | global batch size: 256 | lm loss: 1.917457E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.972 | TFLOPs: 40.32 | 15: iteration 89100/ 125429 | consumed samples: 22809600 | consumed tokens: 46714060800 | elapsed time per iteration (s): 1.07 | learning rate: 5.541E-05 | global batch size: 256 | lm loss: 1.930284E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.918 | TFLOPs: 39.48 | 15: iteration 89110/ 125429 | consumed samples: 22812160 | consumed tokens: 46719303680 | elapsed time per iteration (s): 1.05 | learning rate: 5.539E-05 | global batch size: 256 | lm loss: 1.955594E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.475 | TFLOPs: 40.40 | 15: iteration 89120/ 125429 | consumed samples: 22814720 | consumed tokens: 46724546560 | elapsed time per iteration (s): 1.05 | learning rate: 5.538E-05 | global batch size: 256 | lm loss: 1.918164E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.619 | TFLOPs: 40.43 | 15: iteration 89130/ 125429 | consumed samples: 22817280 | consumed tokens: 46729789440 | elapsed time per iteration (s): 1.07 | learning rate: 5.536E-05 | global batch size: 256 | lm loss: 1.911559E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.119 | TFLOPs: 39.68 | 15: iteration 89140/ 125429 | consumed samples: 22819840 | consumed tokens: 46735032320 | elapsed time per iteration (s): 1.05 | learning rate: 5.534E-05 | global batch size: 256 | lm loss: 1.925169E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.230 | TFLOPs: 40.20 | 15: iteration 89150/ 125429 | consumed samples: 22822400 | consumed tokens: 46740275200 | elapsed time per iteration (s): 1.03 | learning rate: 5.532E-05 | global batch size: 256 | lm loss: 1.940870E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.026 | TFLOPs: 40.99 | 15: iteration 89160/ 125429 | consumed samples: 22824960 | consumed tokens: 46745518080 | elapsed time per iteration (s): 1.04 | learning rate: 5.530E-05 | global batch size: 256 | lm loss: 1.953695E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.968 | TFLOPs: 40.65 | 15: iteration 89170/ 125429 | consumed samples: 22827520 | consumed tokens: 46750760960 | elapsed time per iteration (s): 1.05 | learning rate: 5.529E-05 | global batch size: 256 | lm loss: 1.922776E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.161 | TFLOPs: 40.18 | 15: iteration 89180/ 125429 | consumed samples: 22830080 | consumed tokens: 46756003840 | elapsed time per iteration (s): 1.06 | learning rate: 5.527E-05 | global batch size: 256 | lm loss: 1.957652E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.570 | TFLOPs: 39.76 | 15: iteration 89190/ 125429 | consumed samples: 22832640 | consumed tokens: 46761246720 | elapsed time per iteration (s): 1.08 | learning rate: 5.525E-05 | global batch size: 256 | lm loss: 1.910164E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.373 | TFLOPs: 39.06 | 15: iteration 89200/ 125429 | consumed samples: 22835200 | consumed tokens: 46766489600 | elapsed time per iteration (s): 1.08 | learning rate: 5.523E-05 | global batch size: 256 | lm loss: 1.934718E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.074 | TFLOPs: 39.18 | 15: iteration 89210/ 125429 | consumed samples: 22837760 | consumed tokens: 46771732480 | elapsed time per iteration (s): 1.05 | learning rate: 5.521E-05 | global batch size: 256 | lm loss: 1.887780E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.244 | TFLOPs: 40.20 | 15: iteration 89220/ 125429 | consumed samples: 22840320 | consumed tokens: 46776975360 | elapsed time per iteration (s): 1.04 | learning rate: 5.520E-05 | global batch size: 256 | lm loss: 1.922992E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.251 | TFLOPs: 40.86 | 15: iteration 89230/ 125429 | consumed samples: 22842880 | consumed tokens: 46782218240 | elapsed time per iteration (s): 1.09 | learning rate: 5.518E-05 | global batch size: 256 | lm loss: 1.935184E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.812 | TFLOPs: 38.97 | 15: iteration 89240/ 125429 | consumed samples: 22845440 | consumed tokens: 46787461120 | elapsed time per iteration (s): 1.06 | learning rate: 5.516E-05 | global batch size: 256 | lm loss: 1.928748E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.677 | TFLOPs: 39.77 | 15: iteration 89250/ 125429 | consumed samples: 22848000 | consumed tokens: 46792704000 | elapsed time per iteration (s): 1.04 | learning rate: 5.514E-05 | global batch size: 256 | lm loss: 1.927086E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.752 | TFLOPs: 40.61 | 15: iteration 89260/ 125429 | consumed samples: 22850560 | consumed tokens: 46797946880 | elapsed time per iteration (s): 1.04 | learning rate: 5.512E-05 | global batch size: 256 | lm loss: 1.916466E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.125 | TFLOPs: 40.67 | 15: iteration 89270/ 125429 | consumed samples: 22853120 | consumed tokens: 46803189760 | elapsed time per iteration (s): 1.08 | learning rate: 5.511E-05 | global batch size: 256 | lm loss: 1.932268E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.141 | TFLOPs: 39.19 | 15: iteration 89280/ 125429 | consumed samples: 22855680 | consumed tokens: 46808432640 | elapsed time per iteration (s): 2.24 | learning rate: 5.509E-05 | global batch size: 256 | lm loss: 1.905564E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 114.170 | TFLOPs: 18.87 | 15: iteration 89290/ 125429 | consumed samples: 22858240 | consumed tokens: 46813675520 | elapsed time per iteration (s): 1.04 | learning rate: 5.507E-05 | global batch size: 256 | lm loss: 1.954047E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.832 | TFLOPs: 40.79 | 15: iteration 89300/ 125429 | consumed samples: 22860800 | consumed tokens: 46818918400 | elapsed time per iteration (s): 1.05 | learning rate: 5.505E-05 | global batch size: 256 | lm loss: 1.952449E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.353 | TFLOPs: 40.38 | 15: iteration 89310/ 125429 | consumed samples: 22863360 | consumed tokens: 46824161280 | elapsed time per iteration (s): 1.04 | learning rate: 5.503E-05 | global batch size: 256 | lm loss: 1.920182E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.789 | TFLOPs: 40.62 | 15: iteration 89320/ 125429 | consumed samples: 22865920 | consumed tokens: 46829404160 | elapsed time per iteration (s): 1.03 | learning rate: 5.502E-05 | global batch size: 256 | lm loss: 1.969240E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.411 | TFLOPs: 40.89 | 15: iteration 89330/ 125429 | consumed samples: 22868480 | consumed tokens: 46834647040 | elapsed time per iteration (s): 1.03 | learning rate: 5.500E-05 | global batch size: 256 | lm loss: 1.939516E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.427 | TFLOPs: 41.05 | 15: iteration 89340/ 125429 | consumed samples: 22871040 | consumed tokens: 46839889920 | elapsed time per iteration (s): 1.03 | learning rate: 5.498E-05 | global batch size: 256 | lm loss: 1.922074E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.513 | TFLOPs: 41.23 | 15: iteration 89350/ 125429 | consumed samples: 22873600 | consumed tokens: 46845132800 | elapsed time per iteration (s): 1.08 | learning rate: 5.496E-05 | global batch size: 256 | lm loss: 1.942050E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.708 | TFLOPs: 39.28 | 15: iteration 89360/ 125429 | consumed samples: 22876160 | consumed tokens: 46850375680 | elapsed time per iteration (s): 1.05 | learning rate: 5.494E-05 | global batch size: 256 | lm loss: 1.904118E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.016 | TFLOPs: 40.16 | 15: iteration 89370/ 125429 | consumed samples: 22878720 | consumed tokens: 46855618560 | elapsed time per iteration (s): 1.06 | learning rate: 5.493E-05 | global batch size: 256 | lm loss: 1.926473E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.325 | TFLOPs: 40.05 | 15: iteration 89380/ 125429 | consumed samples: 22881280 | consumed tokens: 46860861440 | elapsed time per iteration (s): 1.03 | learning rate: 5.491E-05 | global batch size: 256 | lm loss: 1.906014E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.481 | TFLOPs: 40.90 | 15: iteration 89390/ 125429 | consumed samples: 22883840 | consumed tokens: 46866104320 | elapsed time per iteration (s): 1.03 | learning rate: 5.489E-05 | global batch size: 256 | lm loss: 1.933612E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.261 | TFLOPs: 41.19 | 15: iteration 89400/ 125429 | consumed samples: 22886400 | consumed tokens: 46871347200 | elapsed time per iteration (s): 1.03 | learning rate: 5.487E-05 | global batch size: 256 | lm loss: 1.943854E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.190 | TFLOPs: 41.02 | 15: iteration 89410/ 125429 | consumed samples: 22888960 | consumed tokens: 46876590080 | elapsed time per iteration (s): 1.06 | learning rate: 5.485E-05 | global batch size: 256 | lm loss: 1.916150E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.163 | TFLOPs: 40.02 | 15: iteration 89420/ 125429 | consumed samples: 22891520 | consumed tokens: 46881832960 | elapsed time per iteration (s): 1.02 | learning rate: 5.484E-05 | global batch size: 256 | lm loss: 1.949569E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.556 | TFLOPs: 41.41 | 15: iteration 89430/ 125429 | consumed samples: 22894080 | consumed tokens: 46887075840 | elapsed time per iteration (s): 1.05 | learning rate: 5.482E-05 | global batch size: 256 | lm loss: 1.950529E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.031 | TFLOPs: 40.16 | 15: iteration 89440/ 125429 | consumed samples: 22896640 | consumed tokens: 46892318720 | elapsed time per iteration (s): 1.05 | learning rate: 5.480E-05 | global batch size: 256 | lm loss: 1.951506E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.950 | TFLOPs: 40.15 | 15: iteration 89450/ 125429 | consumed samples: 22899200 | consumed tokens: 46897561600 | elapsed time per iteration (s): 1.06 | learning rate: 5.478E-05 | global batch size: 256 | lm loss: 1.920588E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.860 | TFLOPs: 39.97 | 15: iteration 89460/ 125429 | consumed samples: 22901760 | consumed tokens: 46902804480 | elapsed time per iteration (s): 1.04 | learning rate: 5.476E-05 | global batch size: 256 | lm loss: 1.909372E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.140 | TFLOPs: 40.84 | 15: iteration 89470/ 125429 | consumed samples: 22904320 | consumed tokens: 46908047360 | elapsed time per iteration (s): 1.05 | learning rate: 5.475E-05 | global batch size: 256 | lm loss: 1.935328E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.879 | TFLOPs: 40.14 | 15: iteration 89480/ 125429 | consumed samples: 22906880 | consumed tokens: 46913290240 | elapsed time per iteration (s): 1.06 | learning rate: 5.473E-05 | global batch size: 256 | lm loss: 1.920362E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.865 | TFLOPs: 39.97 | 15: iteration 89490/ 125429 | consumed samples: 22909440 | consumed tokens: 46918533120 | elapsed time per iteration (s): 1.04 | learning rate: 5.471E-05 | global batch size: 256 | lm loss: 1.953456E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.658 | TFLOPs: 40.60 | 15: iteration 89500/ 125429 | consumed samples: 22912000 | consumed tokens: 46923776000 | elapsed time per iteration (s): 1.06 | learning rate: 5.469E-05 | global batch size: 256 | lm loss: 1.911719E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.785 | TFLOPs: 39.96 | 15: iteration 89510/ 125429 | consumed samples: 22914560 | consumed tokens: 46929018880 | elapsed time per iteration (s): 1.03 | learning rate: 5.467E-05 | global batch size: 256 | lm loss: 1.944937E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.776 | TFLOPs: 41.11 | 15: iteration 89520/ 125429 | consumed samples: 22917120 | consumed tokens: 46934261760 | elapsed time per iteration (s): 1.04 | learning rate: 5.466E-05 | global batch size: 256 | lm loss: 1.951424E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.140 | TFLOPs: 40.51 | 15: iteration 89530/ 125429 | consumed samples: 22919680 | consumed tokens: 46939504640 | elapsed time per iteration (s): 1.06 | learning rate: 5.464E-05 | global batch size: 256 | lm loss: 1.962175E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.574 | TFLOPs: 39.76 | 15: iteration 89540/ 125429 | consumed samples: 22922240 | consumed tokens: 46944747520 | elapsed time per iteration (s): 1.07 | learning rate: 5.462E-05 | global batch size: 256 | lm loss: 1.955379E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.177 | TFLOPs: 39.69 | 15: iteration 89550/ 125429 | consumed samples: 22924800 | consumed tokens: 46949990400 | elapsed time per iteration (s): 1.04 | learning rate: 5.460E-05 | global batch size: 256 | lm loss: 1.925676E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.199 | TFLOPs: 40.69 | 15: iteration 89560/ 125429 | consumed samples: 22927360 | consumed tokens: 46955233280 | elapsed time per iteration (s): 1.09 | learning rate: 5.458E-05 | global batch size: 256 | lm loss: 1.921404E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.862 | TFLOPs: 38.81 | 15: iteration 89570/ 125429 | consumed samples: 22929920 | consumed tokens: 46960476160 | elapsed time per iteration (s): 1.02 | learning rate: 5.457E-05 | global batch size: 256 | lm loss: 1.941036E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.934 | TFLOPs: 41.47 | 15: iteration 89580/ 125429 | consumed samples: 22932480 | consumed tokens: 46965719040 | elapsed time per iteration (s): 1.05 | learning rate: 5.455E-05 | global batch size: 256 | lm loss: 1.954289E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.374 | TFLOPs: 40.22 | 15: iteration 89590/ 125429 | consumed samples: 22935040 | consumed tokens: 46970961920 | elapsed time per iteration (s): 1.03 | learning rate: 5.453E-05 | global batch size: 256 | lm loss: 1.933487E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.540 | TFLOPs: 41.24 | 15: iteration 89600/ 125429 | consumed samples: 22937600 | consumed tokens: 46976204800 | elapsed time per iteration (s): 1.06 | learning rate: 5.451E-05 | global batch size: 256 | lm loss: 1.952477E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.289 | TFLOPs: 40.04 | 15: iteration 89610/ 125429 | consumed samples: 22940160 | consumed tokens: 46981447680 | elapsed time per iteration (s): 1.04 | learning rate: 5.449E-05 | global batch size: 256 | lm loss: 1.938380E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.098 | TFLOPs: 40.67 | 15: iteration 89620/ 125429 | consumed samples: 22942720 | consumed tokens: 46986690560 | elapsed time per iteration (s): 1.07 | learning rate: 5.448E-05 | global batch size: 256 | lm loss: 1.950081E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.234 | TFLOPs: 39.37 | 15: iteration 89630/ 125429 | consumed samples: 22945280 | consumed tokens: 46991933440 | elapsed time per iteration (s): 1.16 | learning rate: 5.446E-05 | global batch size: 256 | lm loss: 1.929246E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.391 | TFLOPs: 36.59 | 15: iteration 89640/ 125429 | consumed samples: 22947840 | consumed tokens: 46997176320 | elapsed time per iteration (s): 1.05 | learning rate: 5.444E-05 | global batch size: 256 | lm loss: 1.905831E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.899 | TFLOPs: 40.14 | 15: iteration 89650/ 125429 | consumed samples: 22950400 | consumed tokens: 47002419200 | elapsed time per iteration (s): 1.03 | learning rate: 5.442E-05 | global batch size: 256 | lm loss: 1.924071E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.854 | TFLOPs: 41.13 | 15: iteration 89660/ 125429 | consumed samples: 22952960 | consumed tokens: 47007662080 | elapsed time per iteration (s): 1.05 | learning rate: 5.440E-05 | global batch size: 256 | lm loss: 1.924228E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.272 | TFLOPs: 40.37 | 15: iteration 89670/ 125429 | consumed samples: 22955520 | consumed tokens: 47012904960 | elapsed time per iteration (s): 1.05 | learning rate: 5.439E-05 | global batch size: 256 | lm loss: 1.895487E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.589 | TFLOPs: 40.26 | 15: iteration 89680/ 125429 | consumed samples: 22958080 | consumed tokens: 47018147840 | elapsed time per iteration (s): 1.04 | learning rate: 5.437E-05 | global batch size: 256 | lm loss: 1.938363E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.369 | TFLOPs: 40.71 | 15: iteration 89690/ 125429 | consumed samples: 22960640 | consumed tokens: 47023390720 | elapsed time per iteration (s): 1.03 | learning rate: 5.435E-05 | global batch size: 256 | lm loss: 1.939716E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.486 | TFLOPs: 41.06 | 15: iteration 89700/ 125429 | consumed samples: 22963200 | consumed tokens: 47028633600 | elapsed time per iteration (s): 1.06 | learning rate: 5.433E-05 | global batch size: 256 | lm loss: 1.922923E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.564 | TFLOPs: 39.76 | 15: iteration 89710/ 125429 | consumed samples: 22965760 | consumed tokens: 47033876480 | elapsed time per iteration (s): 1.03 | learning rate: 5.432E-05 | global batch size: 256 | lm loss: 1.936420E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.667 | TFLOPs: 41.26 | 15: iteration 89720/ 125429 | consumed samples: 22968320 | consumed tokens: 47039119360 | elapsed time per iteration (s): 1.03 | learning rate: 5.430E-05 | global batch size: 256 | lm loss: 1.913170E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.012 | TFLOPs: 40.99 | 15: iteration 89730/ 125429 | consumed samples: 22970880 | consumed tokens: 47044362240 | elapsed time per iteration (s): 1.06 | learning rate: 5.428E-05 | global batch size: 256 | lm loss: 1.922689E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.715 | TFLOPs: 39.95 | 15: iteration 89740/ 125429 | consumed samples: 22973440 | consumed tokens: 47049605120 | elapsed time per iteration (s): 1.03 | learning rate: 5.426E-05 | global batch size: 256 | lm loss: 1.959486E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.246 | TFLOPs: 41.19 | 15: iteration 89750/ 125429 | consumed samples: 22976000 | consumed tokens: 47054848000 | elapsed time per iteration (s): 1.05 | learning rate: 5.424E-05 | global batch size: 256 | lm loss: 1.913473E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.878 | TFLOPs: 40.47 | 15: iteration 89760/ 125429 | consumed samples: 22978560 | consumed tokens: 47060090880 | elapsed time per iteration (s): 1.02 | learning rate: 5.423E-05 | global batch size: 256 | lm loss: 1.944674E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.413 | TFLOPs: 41.38 | 15: iteration 89770/ 125429 | consumed samples: 22981120 | consumed tokens: 47065333760 | elapsed time per iteration (s): 1.06 | learning rate: 5.421E-05 | global batch size: 256 | lm loss: 1.929613E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.308 | TFLOPs: 39.88 | 15: iteration 89780/ 125429 | consumed samples: 22983680 | consumed tokens: 47070576640 | elapsed time per iteration (s): 1.03 | learning rate: 5.419E-05 | global batch size: 256 | lm loss: 1.900816E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.111 | TFLOPs: 41.00 | 15: iteration 89790/ 125429 | consumed samples: 22986240 | consumed tokens: 47075819520 | elapsed time per iteration (s): 1.06 | learning rate: 5.417E-05 | global batch size: 256 | lm loss: 1.910558E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.789 | TFLOPs: 39.96 | 15: iteration 89800/ 125429 | consumed samples: 22988800 | consumed tokens: 47081062400 | elapsed time per iteration (s): 1.11 | learning rate: 5.415E-05 | global batch size: 256 | lm loss: 1.934653E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.159 | TFLOPs: 38.20 | 15: iteration 89810/ 125429 | consumed samples: 22991360 | consumed tokens: 47086305280 | elapsed time per iteration (s): 1.04 | learning rate: 5.414E-05 | global batch size: 256 | lm loss: 1.938375E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.194 | TFLOPs: 40.52 | 15: iteration 89820/ 125429 | consumed samples: 22993920 | consumed tokens: 47091548160 | elapsed time per iteration (s): 1.06 | learning rate: 5.412E-05 | global batch size: 256 | lm loss: 1.886635E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.260 | TFLOPs: 39.87 | 15: iteration 89830/ 125429 | consumed samples: 22996480 | consumed tokens: 47096791040 | elapsed time per iteration (s): 1.07 | learning rate: 5.410E-05 | global batch size: 256 | lm loss: 1.949575E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.940 | TFLOPs: 39.65 | 15: iteration 89840/ 125429 | consumed samples: 22999040 | consumed tokens: 47102033920 | elapsed time per iteration (s): 1.04 | learning rate: 5.408E-05 | global batch size: 256 | lm loss: 1.934748E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.105 | TFLOPs: 40.84 | 15: iteration 89850/ 125429 | consumed samples: 23001600 | consumed tokens: 47107276800 | elapsed time per iteration (s): 1.05 | learning rate: 5.407E-05 | global batch size: 256 | lm loss: 1.915504E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.820 | TFLOPs: 40.46 | 15: iteration 89860/ 125429 | consumed samples: 23004160 | consumed tokens: 47112519680 | elapsed time per iteration (s): 1.03 | learning rate: 5.405E-05 | global batch size: 256 | lm loss: 1.905628E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.745 | TFLOPs: 40.94 | 15: iteration 89870/ 125429 | consumed samples: 23006720 | consumed tokens: 47117762560 | elapsed time per iteration (s): 1.05 | learning rate: 5.403E-05 | global batch size: 256 | lm loss: 1.916126E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.656 | TFLOPs: 40.43 | 15: iteration 89880/ 125429 | consumed samples: 23009280 | consumed tokens: 47123005440 | elapsed time per iteration (s): 1.04 | learning rate: 5.401E-05 | global batch size: 256 | lm loss: 1.909925E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.584 | TFLOPs: 40.58 | 15: iteration 89890/ 125429 | consumed samples: 23011840 | consumed tokens: 47128248320 | elapsed time per iteration (s): 1.05 | learning rate: 5.399E-05 | global batch size: 256 | lm loss: 1.879429E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.947 | TFLOPs: 40.31 | 15: iteration 89900/ 125429 | consumed samples: 23014400 | consumed tokens: 47133491200 | elapsed time per iteration (s): 1.04 | learning rate: 5.398E-05 | global batch size: 256 | lm loss: 1.892868E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.115 | TFLOPs: 40.51 | 15: iteration 89910/ 125429 | consumed samples: 23016960 | consumed tokens: 47138734080 | elapsed time per iteration (s): 1.05 | learning rate: 5.396E-05 | global batch size: 256 | lm loss: 1.928543E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.632 | TFLOPs: 40.43 | 15: iteration 89920/ 125429 | consumed samples: 23019520 | consumed tokens: 47143976960 | elapsed time per iteration (s): 1.06 | learning rate: 5.394E-05 | global batch size: 256 | lm loss: 1.928238E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.572 | TFLOPs: 40.09 | 15: iteration 89930/ 125429 | consumed samples: 23022080 | consumed tokens: 47149219840 | elapsed time per iteration (s): 1.05 | learning rate: 5.392E-05 | global batch size: 256 | lm loss: 1.946713E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.143 | TFLOPs: 40.35 | 15: iteration 89940/ 125429 | consumed samples: 23024640 | consumed tokens: 47154462720 | elapsed time per iteration (s): 1.04 | learning rate: 5.390E-05 | global batch size: 256 | lm loss: 1.926173E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.351 | TFLOPs: 40.71 | 15: iteration 89950/ 125429 | consumed samples: 23027200 | consumed tokens: 47159705600 | elapsed time per iteration (s): 1.05 | learning rate: 5.389E-05 | global batch size: 256 | lm loss: 1.936291E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.851 | TFLOPs: 40.46 | 15: iteration 89960/ 125429 | consumed samples: 23029760 | consumed tokens: 47164948480 | elapsed time per iteration (s): 1.04 | learning rate: 5.387E-05 | global batch size: 256 | lm loss: 1.929349E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.979 | TFLOPs: 40.48 | 15: iteration 89970/ 125429 | consumed samples: 23032320 | consumed tokens: 47170191360 | elapsed time per iteration (s): 1.05 | learning rate: 5.385E-05 | global batch size: 256 | lm loss: 1.929091E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.784 | TFLOPs: 40.45 | 15: iteration 89980/ 125429 | consumed samples: 23034880 | consumed tokens: 47175434240 | elapsed time per iteration (s): 1.04 | learning rate: 5.383E-05 | global batch size: 256 | lm loss: 1.924121E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.000 | TFLOPs: 40.49 | 15: iteration 89990/ 125429 | consumed samples: 23037440 | consumed tokens: 47180677120 | elapsed time per iteration (s): 1.05 | learning rate: 5.382E-05 | global batch size: 256 | lm loss: 1.918972E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.521 | TFLOPs: 40.41 | 0: [2022-11-26 22:42:19,750] [INFO] [logging.py:68:log_dist] [Rank 0] step=90000, skipped=0, lr=[5.379793555468545e-05, 5.379793555468545e-05, 5.379793555468545e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 90000/ 125429 | consumed samples: 23040000 | consumed tokens: 47185920000 | elapsed time per iteration (s): 1.03 | learning rate: 5.380E-05 | global batch size: 256 | lm loss: 1.929300E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.775 | TFLOPs: 41.11 | 0: steps: 90000 loss: 1.8831 iter time (s): 1.065 samples/sec: 240.473 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 90000 | lm loss value: 1.838693E+00 | lm loss PPL: 6.288317E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 90000 to checkpoints_1b5 0: [2022-11-26 22:42:20,108] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step90000 is begin to save! 0: [2022-11-26 22:42:20,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_01-model_00-model_states.pt... 0: [2022-11-26 22:42:20,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_01-model_00-model_states.pt. 0: [2022-11-26 22:42:20,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_03-model_00-model_states.pt... 0: [2022-11-26 22:42:20,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_03-model_00-model_states.pt. 0: [2022-11-26 22:42:20,480] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_04-model_00-model_states.pt... 0: [2022-11-26 22:42:20,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_04-model_00-model_states.pt. 0: [2022-11-26 22:42:20,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_05-model_00-model_states.pt... 0: [2022-11-26 22:42:20,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_05-model_00-model_states.pt. 0: [2022-11-26 22:42:20,692] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_06-model_00-model_states.pt... 0: [2022-11-26 22:42:20,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_06-model_00-model_states.pt. 0: [2022-11-26 22:42:20,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_07-model_00-model_states.pt... 0: [2022-11-26 22:42:20,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_07-model_00-model_states.pt. 0: [2022-11-26 22:42:20,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_08-model_00-model_states.pt... 0: [2022-11-26 22:42:21,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_08-model_00-model_states.pt. 0: [2022-11-26 22:42:21,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_09-model_00-model_states.pt... 0: [2022-11-26 22:42:21,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_09-model_00-model_states.pt. 0: [2022-11-26 22:42:21,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_10-model_00-model_states.pt... 0: [2022-11-26 22:42:21,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_10-model_00-model_states.pt. 0: [2022-11-26 22:42:21,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_11-model_00-model_states.pt... 0: [2022-11-26 22:42:21,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_11-model_00-model_states.pt. 0: [2022-11-26 22:42:21,335] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_12-model_00-model_states.pt... 0: [2022-11-26 22:42:21,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_12-model_00-model_states.pt. 0: [2022-11-26 22:42:21,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_13-model_00-model_states.pt... 0: [2022-11-26 22:42:21,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_13-model_00-model_states.pt. 0: [2022-11-26 22:42:21,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_14-model_00-model_states.pt... 0: [2022-11-26 22:42:21,650] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_14-model_00-model_states.pt. 0: [2022-11-26 22:42:21,651] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_15-model_00-model_states.pt... 0: [2022-11-26 22:42:21,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_15-model_00-model_states.pt. 0: [2022-11-26 22:42:21,749] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_16-model_00-model_states.pt... 0: [2022-11-26 22:42:21,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_16-model_00-model_states.pt. 0: [2022-11-26 22:42:21,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_17-model_00-model_states.pt... 0: [2022-11-26 22:42:21,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_17-model_00-model_states.pt. 0: [2022-11-26 22:42:21,959] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_18-model_00-model_states.pt... 0: [2022-11-26 22:42:22,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_18-model_00-model_states.pt. 0: [2022-11-26 22:42:22,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_19-model_00-model_states.pt... 0: [2022-11-26 22:42:22,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_19-model_00-model_states.pt. 0: [2022-11-26 22:42:22,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_20-model_00-model_states.pt... 0: [2022-11-26 22:42:22,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_20-model_00-model_states.pt. 0: [2022-11-26 22:42:22,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_21-model_00-model_states.pt... 0: [2022-11-26 22:42:22,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_21-model_00-model_states.pt. 0: [2022-11-26 22:42:22,393] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_22-model_00-model_states.pt... 0: [2022-11-26 22:42:22,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_22-model_00-model_states.pt. 0: [2022-11-26 22:42:22,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_23-model_00-model_states.pt... 0: [2022-11-26 22:42:22,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_23-model_00-model_states.pt. 0: [2022-11-26 22:42:22,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_24-model_00-model_states.pt... 0: [2022-11-26 22:42:22,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_24-model_00-model_states.pt. 0: [2022-11-26 22:42:22,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_25-model_00-model_states.pt... 0: [2022-11-26 22:42:22,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_25-model_00-model_states.pt. 0: [2022-11-26 22:42:22,831] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_26-model_00-model_states.pt... 0: [2022-11-26 22:42:22,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_26-model_00-model_states.pt. 0: [2022-11-26 22:42:22,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_27-model_00-model_states.pt... 0: [2022-11-26 22:42:23,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_27-model_00-model_states.pt. 0: [2022-11-26 22:42:23,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_28-model_00-model_states.pt... 0: [2022-11-26 22:42:23,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_28-model_00-model_states.pt. 0: [2022-11-26 22:42:23,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_29-model_00-model_states.pt... 0: [2022-11-26 22:42:23,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_29-model_00-model_states.pt. 0: [2022-11-26 22:42:23,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_30-model_00-model_states.pt... 0: [2022-11-26 22:42:23,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_30-model_00-model_states.pt. 0: [2022-11-26 22:42:23,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/layer_32-model_00-model_states.pt... 0: [2022-11-26 22:42:23,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/layer_32-model_00-model_states.pt. 0: [2022-11-26 22:42:23,380] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step90000/mp_rank_00_model_states.pt 0: [2022-11-26 22:42:23,380] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/mp_rank_00_model_states.pt... 0: [2022-11-26 22:42:23,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/mp_rank_00_model_states.pt. 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 9: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 7: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 12: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-26 22:42:23,421] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step90000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 3: [2022-11-26 22:42:23,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:42:23,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 22:42:23,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 22:42:23,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 22:42:23,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 22:42:23,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,587] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,587] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 22:42:23,587] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 22:42:23,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 22:42:23,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 22:42:23,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:42:23,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:42:23,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 22:42:23,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,595] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,595] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 22:42:23,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 22:42:23,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:42:23,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:42:23,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 2: [2022-11-26 22:42:23,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 22:42:23,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 2: [2022-11-26 22:42:23,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 22:42:23,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 22:42:23,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 22:42:23,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 22:42:23,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,606] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,606] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:42:23,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:42:23,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:42:23,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,614] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,614] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:42:23,608] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,608] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:42:23,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:42:23,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 22:42:23,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 22:42:23,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,605] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 22:42:23,605] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 15: [2022-11-26 22:42:23,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 22:42:23,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:42:23,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,611] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,611] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 7: [2022-11-26 22:42:23,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 22:42:23,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 22:42:23,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 22:42:23,623] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:42:23,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 9: [2022-11-26 22:42:23,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 22:42:23,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 22:42:23,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 22:42:23,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 3: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:42:23,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:42:23,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 22:42:23,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 22:42:23,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 10: [2022-11-26 22:42:23,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 4: [2022-11-26 22:42:23,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 22:42:23,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 22:42:23,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 11: [2022-11-26 22:42:23,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 22:42:23,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 22:42:23,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-26 22:42:23,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,635] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-26 22:42:23,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,635] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 8: [2022-11-26 22:42:23,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 22:42:23,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 13: [2022-11-26 22:42:23,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 22:42:23,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 22:42:23,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 15: [2022-11-26 22:42:23,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 22:42:23,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 14: [2022-11-26 22:42:23,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 22:42:23,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 22:42:23,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 12: [2022-11-26 22:42:23,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 22:42:23,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 22:42:23,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 1: [2022-11-26 22:42:23,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 22:42:23,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 5: [2022-11-26 22:42:23,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 22:42:23,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 22:42:23,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 6: [2022-11-26 22:42:23,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 22:42:23,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 22:42:23,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 2: [2022-11-26 22:42:23,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 22:42:23,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 22:42:23,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: [2022-11-26 22:42:23,763] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step90000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 22:42:23,763] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step90000 is ready now! 0: successfully saved checkpoint at iteration 90000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3675.33 15: iteration 90010/ 125429 | consumed samples: 23042560 | consumed tokens: 47191162880 | elapsed time per iteration (s): 1.49 | learning rate: 5.378E-05 | global batch size: 256 | lm loss: 1.881673E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.903 | TFLOPs: 28.41 | 15: iteration 90020/ 125429 | consumed samples: 23045120 | consumed tokens: 47196405760 | elapsed time per iteration (s): 1.04 | learning rate: 5.376E-05 | global batch size: 256 | lm loss: 1.914041E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.647 | TFLOPs: 40.76 | 15: iteration 90030/ 125429 | consumed samples: 23047680 | consumed tokens: 47201648640 | elapsed time per iteration (s): 1.05 | learning rate: 5.374E-05 | global batch size: 256 | lm loss: 1.941265E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.395 | TFLOPs: 40.39 | 15: iteration 90040/ 125429 | consumed samples: 23050240 | consumed tokens: 47206891520 | elapsed time per iteration (s): 1.02 | learning rate: 5.373E-05 | global batch size: 256 | lm loss: 1.914764E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.375 | TFLOPs: 41.38 | 15: iteration 90050/ 125429 | consumed samples: 23052800 | consumed tokens: 47212134400 | elapsed time per iteration (s): 1.04 | learning rate: 5.371E-05 | global batch size: 256 | lm loss: 1.910982E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.881 | TFLOPs: 40.80 | 15: iteration 90060/ 125429 | consumed samples: 23055360 | consumed tokens: 47217377280 | elapsed time per iteration (s): 1.03 | learning rate: 5.369E-05 | global batch size: 256 | lm loss: 1.920428E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.621 | TFLOPs: 41.09 | 15: iteration 90070/ 125429 | consumed samples: 23057920 | consumed tokens: 47222620160 | elapsed time per iteration (s): 1.03 | learning rate: 5.367E-05 | global batch size: 256 | lm loss: 1.927227E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.476 | TFLOPs: 41.06 | 15: iteration 90080/ 125429 | consumed samples: 23060480 | consumed tokens: 47227863040 | elapsed time per iteration (s): 1.07 | learning rate: 5.366E-05 | global batch size: 256 | lm loss: 1.912742E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.908 | TFLOPs: 39.65 | 15: iteration 90090/ 125429 | consumed samples: 23063040 | consumed tokens: 47233105920 | elapsed time per iteration (s): 1.03 | learning rate: 5.364E-05 | global batch size: 256 | lm loss: 1.930612E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.508 | TFLOPs: 41.07 | 15: iteration 90100/ 125429 | consumed samples: 23065600 | consumed tokens: 47238348800 | elapsed time per iteration (s): 1.07 | learning rate: 5.362E-05 | global batch size: 256 | lm loss: 1.899699E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.018 | TFLOPs: 39.50 | 15: iteration 90110/ 125429 | consumed samples: 23068160 | consumed tokens: 47243591680 | elapsed time per iteration (s): 1.06 | learning rate: 5.360E-05 | global batch size: 256 | lm loss: 1.930073E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.208 | TFLOPs: 40.03 | 15: iteration 90120/ 125429 | consumed samples: 23070720 | consumed tokens: 47248834560 | elapsed time per iteration (s): 1.05 | learning rate: 5.358E-05 | global batch size: 256 | lm loss: 1.954878E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.249 | TFLOPs: 40.36 | 15: iteration 90130/ 125429 | consumed samples: 23073280 | consumed tokens: 47254077440 | elapsed time per iteration (s): 1.04 | learning rate: 5.357E-05 | global batch size: 256 | lm loss: 1.911788E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.272 | TFLOPs: 40.53 | 15: iteration 90140/ 125429 | consumed samples: 23075840 | consumed tokens: 47259320320 | elapsed time per iteration (s): 1.03 | learning rate: 5.355E-05 | global batch size: 256 | lm loss: 1.933425E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.030 | TFLOPs: 40.99 | 15: iteration 90150/ 125429 | consumed samples: 23078400 | consumed tokens: 47264563200 | elapsed time per iteration (s): 1.08 | learning rate: 5.353E-05 | global batch size: 256 | lm loss: 1.908011E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.028 | TFLOPs: 39.01 | 15: iteration 90160/ 125429 | consumed samples: 23080960 | consumed tokens: 47269806080 | elapsed time per iteration (s): 1.04 | learning rate: 5.351E-05 | global batch size: 256 | lm loss: 1.923901E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.004 | TFLOPs: 40.49 | 15: iteration 90170/ 125429 | consumed samples: 23083520 | consumed tokens: 47275048960 | elapsed time per iteration (s): 1.02 | learning rate: 5.350E-05 | global batch size: 256 | lm loss: 1.932839E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.302 | TFLOPs: 41.36 | 15: iteration 90180/ 125429 | consumed samples: 23086080 | consumed tokens: 47280291840 | elapsed time per iteration (s): 1.11 | learning rate: 5.348E-05 | global batch size: 256 | lm loss: 1.922679E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.661 | TFLOPs: 38.28 | 15: iteration 90190/ 125429 | consumed samples: 23088640 | consumed tokens: 47285534720 | elapsed time per iteration (s): 1.06 | learning rate: 5.346E-05 | global batch size: 256 | lm loss: 1.912597E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.915 | TFLOPs: 39.81 | 15: iteration 90200/ 125429 | consumed samples: 23091200 | consumed tokens: 47290777600 | elapsed time per iteration (s): 1.04 | learning rate: 5.344E-05 | global batch size: 256 | lm loss: 1.892584E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.618 | TFLOPs: 40.59 | 15: iteration 90210/ 125429 | consumed samples: 23093760 | consumed tokens: 47296020480 | elapsed time per iteration (s): 1.04 | learning rate: 5.343E-05 | global batch size: 256 | lm loss: 1.920441E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.030 | TFLOPs: 40.82 | 15: iteration 90220/ 125429 | consumed samples: 23096320 | consumed tokens: 47301263360 | elapsed time per iteration (s): 1.05 | learning rate: 5.341E-05 | global batch size: 256 | lm loss: 1.930199E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.277 | TFLOPs: 40.37 | 15: iteration 90230/ 125429 | consumed samples: 23098880 | consumed tokens: 47306506240 | elapsed time per iteration (s): 1.05 | learning rate: 5.339E-05 | global batch size: 256 | lm loss: 1.940551E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.392 | TFLOPs: 40.22 | 15: iteration 90240/ 125429 | consumed samples: 23101440 | consumed tokens: 47311749120 | elapsed time per iteration (s): 1.04 | learning rate: 5.337E-05 | global batch size: 256 | lm loss: 1.916331E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.453 | TFLOPs: 40.56 | 15: iteration 90250/ 125429 | consumed samples: 23104000 | consumed tokens: 47316992000 | elapsed time per iteration (s): 1.04 | learning rate: 5.335E-05 | global batch size: 256 | lm loss: 1.914706E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.523 | TFLOPs: 40.74 | 15: iteration 90260/ 125429 | consumed samples: 23106560 | consumed tokens: 47322234880 | elapsed time per iteration (s): 1.04 | learning rate: 5.334E-05 | global batch size: 256 | lm loss: 1.922771E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.250 | TFLOPs: 40.69 | 15: iteration 90270/ 125429 | consumed samples: 23109120 | consumed tokens: 47327477760 | elapsed time per iteration (s): 1.06 | learning rate: 5.332E-05 | global batch size: 256 | lm loss: 1.925325E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.749 | TFLOPs: 39.95 | 15: iteration 90280/ 125429 | consumed samples: 23111680 | consumed tokens: 47332720640 | elapsed time per iteration (s): 1.04 | learning rate: 5.330E-05 | global batch size: 256 | lm loss: 1.913229E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.624 | TFLOPs: 40.76 | 15: iteration 90290/ 125429 | consumed samples: 23114240 | consumed tokens: 47337963520 | elapsed time per iteration (s): 1.07 | learning rate: 5.328E-05 | global batch size: 256 | lm loss: 1.911418E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.348 | TFLOPs: 39.72 | 15: iteration 90300/ 125429 | consumed samples: 23116800 | consumed tokens: 47343206400 | elapsed time per iteration (s): 1.22 | learning rate: 5.327E-05 | global batch size: 256 | lm loss: 1.931386E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 209.950 | TFLOPs: 34.70 | 15: iteration 90310/ 125429 | consumed samples: 23119360 | consumed tokens: 47348449280 | elapsed time per iteration (s): 1.03 | learning rate: 5.325E-05 | global batch size: 256 | lm loss: 1.916300E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.001 | TFLOPs: 41.15 | 15: iteration 90320/ 125429 | consumed samples: 23121920 | consumed tokens: 47353692160 | elapsed time per iteration (s): 1.03 | learning rate: 5.323E-05 | global batch size: 256 | lm loss: 1.899734E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.576 | TFLOPs: 41.24 | 15: iteration 90330/ 125429 | consumed samples: 23124480 | consumed tokens: 47358935040 | elapsed time per iteration (s): 1.13 | learning rate: 5.321E-05 | global batch size: 256 | lm loss: 1.901079E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.345 | TFLOPs: 37.57 | 15: iteration 90340/ 125429 | consumed samples: 23127040 | consumed tokens: 47364177920 | elapsed time per iteration (s): 1.05 | learning rate: 5.320E-05 | global batch size: 256 | lm loss: 1.943835E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.603 | TFLOPs: 40.26 | 15: iteration 90350/ 125429 | consumed samples: 23129600 | consumed tokens: 47369420800 | elapsed time per iteration (s): 1.02 | learning rate: 5.318E-05 | global batch size: 256 | lm loss: 1.896398E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.776 | TFLOPs: 41.28 | 15: iteration 90360/ 125429 | consumed samples: 23132160 | consumed tokens: 47374663680 | elapsed time per iteration (s): 1.03 | learning rate: 5.316E-05 | global batch size: 256 | lm loss: 1.941284E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.367 | TFLOPs: 41.21 | 15: iteration 90370/ 125429 | consumed samples: 23134720 | consumed tokens: 47379906560 | elapsed time per iteration (s): 1.03 | learning rate: 5.314E-05 | global batch size: 256 | lm loss: 1.917067E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.418 | TFLOPs: 41.22 | 15: iteration 90380/ 125429 | consumed samples: 23137280 | consumed tokens: 47385149440 | elapsed time per iteration (s): 1.17 | learning rate: 5.312E-05 | global batch size: 256 | lm loss: 1.910774E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 219.002 | TFLOPs: 36.19 | 15: iteration 90390/ 125429 | consumed samples: 23139840 | consumed tokens: 47390392320 | elapsed time per iteration (s): 1.04 | learning rate: 5.311E-05 | global batch size: 256 | lm loss: 1.943886E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.606 | TFLOPs: 40.75 | 15: iteration 90400/ 125429 | consumed samples: 23142400 | consumed tokens: 47395635200 | elapsed time per iteration (s): 1.04 | learning rate: 5.309E-05 | global batch size: 256 | lm loss: 1.928868E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.516 | TFLOPs: 40.57 | 15: iteration 90410/ 125429 | consumed samples: 23144960 | consumed tokens: 47400878080 | elapsed time per iteration (s): 1.04 | learning rate: 5.307E-05 | global batch size: 256 | lm loss: 1.898402E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.636 | TFLOPs: 40.76 | 15: iteration 90420/ 125429 | consumed samples: 23147520 | consumed tokens: 47406120960 | elapsed time per iteration (s): 1.03 | learning rate: 5.305E-05 | global batch size: 256 | lm loss: 1.936025E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.212 | TFLOPs: 41.18 | 15: iteration 90430/ 125429 | consumed samples: 23150080 | consumed tokens: 47411363840 | elapsed time per iteration (s): 1.03 | learning rate: 5.304E-05 | global batch size: 256 | lm loss: 1.909860E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.516 | TFLOPs: 41.07 | 15: iteration 90440/ 125429 | consumed samples: 23152640 | consumed tokens: 47416606720 | elapsed time per iteration (s): 1.04 | learning rate: 5.302E-05 | global batch size: 256 | lm loss: 1.924869E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.245 | TFLOPs: 40.86 | 15: iteration 90450/ 125429 | consumed samples: 23155200 | consumed tokens: 47421849600 | elapsed time per iteration (s): 1.05 | learning rate: 5.300E-05 | global batch size: 256 | lm loss: 1.937829E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.341 | TFLOPs: 40.38 | 15: iteration 90460/ 125429 | consumed samples: 23157760 | consumed tokens: 47427092480 | elapsed time per iteration (s): 1.03 | learning rate: 5.298E-05 | global batch size: 256 | lm loss: 1.927266E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.837 | TFLOPs: 41.12 | 15: iteration 90470/ 125429 | consumed samples: 23160320 | consumed tokens: 47432335360 | elapsed time per iteration (s): 1.02 | learning rate: 5.297E-05 | global batch size: 256 | lm loss: 1.920117E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.172 | TFLOPs: 41.34 | 15: iteration 90480/ 125429 | consumed samples: 23162880 | consumed tokens: 47437578240 | elapsed time per iteration (s): 1.05 | learning rate: 5.295E-05 | global batch size: 256 | lm loss: 1.942813E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.695 | TFLOPs: 40.44 | 15: iteration 90490/ 125429 | consumed samples: 23165440 | consumed tokens: 47442821120 | elapsed time per iteration (s): 1.05 | learning rate: 5.293E-05 | global batch size: 256 | lm loss: 1.918481E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.808 | TFLOPs: 40.13 | 15: iteration 90500/ 125429 | consumed samples: 23168000 | consumed tokens: 47448064000 | elapsed time per iteration (s): 1.03 | learning rate: 5.291E-05 | global batch size: 256 | lm loss: 1.941773E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.465 | TFLOPs: 41.06 | 15: iteration 90510/ 125429 | consumed samples: 23170560 | consumed tokens: 47453306880 | elapsed time per iteration (s): 1.03 | learning rate: 5.290E-05 | global batch size: 256 | lm loss: 1.902249E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.579 | TFLOPs: 41.24 | 15: iteration 90520/ 125429 | consumed samples: 23173120 | consumed tokens: 47458549760 | elapsed time per iteration (s): 1.04 | learning rate: 5.288E-05 | global batch size: 256 | lm loss: 1.925626E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.181 | TFLOPs: 40.68 | 15: iteration 90530/ 125429 | consumed samples: 23175680 | consumed tokens: 47463792640 | elapsed time per iteration (s): 1.04 | learning rate: 5.286E-05 | global batch size: 256 | lm loss: 1.898784E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.400 | TFLOPs: 40.55 | 15: iteration 90540/ 125429 | consumed samples: 23178240 | consumed tokens: 47469035520 | elapsed time per iteration (s): 1.04 | learning rate: 5.284E-05 | global batch size: 256 | lm loss: 1.932019E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.369 | TFLOPs: 40.55 | 15: iteration 90550/ 125429 | consumed samples: 23180800 | consumed tokens: 47474278400 | elapsed time per iteration (s): 1.05 | learning rate: 5.283E-05 | global batch size: 256 | lm loss: 1.922041E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.046 | TFLOPs: 40.33 | 15: iteration 90560/ 125429 | consumed samples: 23183360 | consumed tokens: 47479521280 | elapsed time per iteration (s): 1.03 | learning rate: 5.281E-05 | global batch size: 256 | lm loss: 1.948272E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.489 | TFLOPs: 41.23 | 15: iteration 90570/ 125429 | consumed samples: 23185920 | consumed tokens: 47484764160 | elapsed time per iteration (s): 1.04 | learning rate: 5.279E-05 | global batch size: 256 | lm loss: 1.951843E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.318 | TFLOPs: 40.87 | 15: iteration 90580/ 125429 | consumed samples: 23188480 | consumed tokens: 47490007040 | elapsed time per iteration (s): 1.07 | learning rate: 5.277E-05 | global batch size: 256 | lm loss: 1.897058E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.974 | TFLOPs: 39.66 | 15: iteration 90590/ 125429 | consumed samples: 23191040 | consumed tokens: 47495249920 | elapsed time per iteration (s): 1.03 | learning rate: 5.275E-05 | global batch size: 256 | lm loss: 1.943319E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.579 | TFLOPs: 41.24 | 15: iteration 90600/ 125429 | consumed samples: 23193600 | consumed tokens: 47500492800 | elapsed time per iteration (s): 1.03 | learning rate: 5.274E-05 | global batch size: 256 | lm loss: 1.928686E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.959 | TFLOPs: 41.14 | 15: iteration 90610/ 125429 | consumed samples: 23196160 | consumed tokens: 47505735680 | elapsed time per iteration (s): 1.03 | learning rate: 5.272E-05 | global batch size: 256 | lm loss: 1.916877E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.933 | TFLOPs: 40.97 | 15: iteration 90620/ 125429 | consumed samples: 23198720 | consumed tokens: 47510978560 | elapsed time per iteration (s): 1.04 | learning rate: 5.270E-05 | global batch size: 256 | lm loss: 1.898924E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.809 | TFLOPs: 40.79 | 15: iteration 90630/ 125429 | consumed samples: 23201280 | consumed tokens: 47516221440 | elapsed time per iteration (s): 1.04 | learning rate: 5.268E-05 | global batch size: 256 | lm loss: 1.942902E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.152 | TFLOPs: 40.84 | 15: iteration 90640/ 125429 | consumed samples: 23203840 | consumed tokens: 47521464320 | elapsed time per iteration (s): 1.05 | learning rate: 5.267E-05 | global batch size: 256 | lm loss: 1.916944E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.308 | TFLOPs: 40.21 | 15: iteration 90650/ 125429 | consumed samples: 23206400 | consumed tokens: 47526707200 | elapsed time per iteration (s): 1.04 | learning rate: 5.265E-05 | global batch size: 256 | lm loss: 1.900504E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.192 | TFLOPs: 40.69 | 15: iteration 90660/ 125429 | consumed samples: 23208960 | consumed tokens: 47531950080 | elapsed time per iteration (s): 1.04 | learning rate: 5.263E-05 | global batch size: 256 | lm loss: 1.921395E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.323 | TFLOPs: 40.54 | 15: iteration 90670/ 125429 | consumed samples: 23211520 | consumed tokens: 47537192960 | elapsed time per iteration (s): 1.04 | learning rate: 5.261E-05 | global batch size: 256 | lm loss: 1.935734E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.271 | TFLOPs: 40.53 | 15: iteration 90680/ 125429 | consumed samples: 23214080 | consumed tokens: 47542435840 | elapsed time per iteration (s): 1.05 | learning rate: 5.260E-05 | global batch size: 256 | lm loss: 1.919772E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.239 | TFLOPs: 40.20 | 15: iteration 90690/ 125429 | consumed samples: 23216640 | consumed tokens: 47547678720 | elapsed time per iteration (s): 1.04 | learning rate: 5.258E-05 | global batch size: 256 | lm loss: 1.921611E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.634 | TFLOPs: 40.59 | 15: iteration 90700/ 125429 | consumed samples: 23219200 | consumed tokens: 47552921600 | elapsed time per iteration (s): 1.10 | learning rate: 5.256E-05 | global batch size: 256 | lm loss: 1.913862E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.386 | TFLOPs: 38.57 | 15: iteration 90710/ 125429 | consumed samples: 23221760 | consumed tokens: 47558164480 | elapsed time per iteration (s): 1.03 | learning rate: 5.254E-05 | global batch size: 256 | lm loss: 1.921607E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.132 | TFLOPs: 41.01 | 15: iteration 90720/ 125429 | consumed samples: 23224320 | consumed tokens: 47563407360 | elapsed time per iteration (s): 1.02 | learning rate: 5.253E-05 | global batch size: 256 | lm loss: 1.956955E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.411 | TFLOPs: 41.55 | 15: iteration 90730/ 125429 | consumed samples: 23226880 | consumed tokens: 47568650240 | elapsed time per iteration (s): 1.05 | learning rate: 5.251E-05 | global batch size: 256 | lm loss: 1.922702E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.802 | TFLOPs: 40.12 | 15: iteration 90740/ 125429 | consumed samples: 23229440 | consumed tokens: 47573893120 | elapsed time per iteration (s): 1.04 | learning rate: 5.249E-05 | global batch size: 256 | lm loss: 1.909745E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.134 | TFLOPs: 40.51 | 15: iteration 90750/ 125429 | consumed samples: 23232000 | consumed tokens: 47579136000 | elapsed time per iteration (s): 1.03 | learning rate: 5.247E-05 | global batch size: 256 | lm loss: 1.914316E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.589 | TFLOPs: 40.92 | 15: iteration 90760/ 125429 | consumed samples: 23234560 | consumed tokens: 47584378880 | elapsed time per iteration (s): 1.03 | learning rate: 5.246E-05 | global batch size: 256 | lm loss: 1.920597E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.229 | TFLOPs: 41.19 | 15: iteration 90770/ 125429 | consumed samples: 23237120 | consumed tokens: 47589621760 | elapsed time per iteration (s): 1.04 | learning rate: 5.244E-05 | global batch size: 256 | lm loss: 1.933214E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.044 | TFLOPs: 40.66 | 15: iteration 90780/ 125429 | consumed samples: 23239680 | consumed tokens: 47594864640 | elapsed time per iteration (s): 1.04 | learning rate: 5.242E-05 | global batch size: 256 | lm loss: 1.933852E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.494 | TFLOPs: 40.57 | 15: iteration 90790/ 125429 | consumed samples: 23242240 | consumed tokens: 47600107520 | elapsed time per iteration (s): 1.06 | learning rate: 5.240E-05 | global batch size: 256 | lm loss: 1.915165E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.647 | TFLOPs: 40.10 | 15: iteration 90800/ 125429 | consumed samples: 23244800 | consumed tokens: 47605350400 | elapsed time per iteration (s): 1.03 | learning rate: 5.239E-05 | global batch size: 256 | lm loss: 1.943952E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.541 | TFLOPs: 41.24 | 15: iteration 90810/ 125429 | consumed samples: 23247360 | consumed tokens: 47610593280 | elapsed time per iteration (s): 1.05 | learning rate: 5.237E-05 | global batch size: 256 | lm loss: 1.925806E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.163 | TFLOPs: 40.18 | 15: iteration 90820/ 125429 | consumed samples: 23249920 | consumed tokens: 47615836160 | elapsed time per iteration (s): 1.04 | learning rate: 5.235E-05 | global batch size: 256 | lm loss: 1.926861E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.279 | TFLOPs: 40.53 | 15: iteration 90830/ 125429 | consumed samples: 23252480 | consumed tokens: 47621079040 | elapsed time per iteration (s): 1.04 | learning rate: 5.233E-05 | global batch size: 256 | lm loss: 1.887204E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.288 | TFLOPs: 40.54 | 15: iteration 90840/ 125429 | consumed samples: 23255040 | consumed tokens: 47626321920 | elapsed time per iteration (s): 1.05 | learning rate: 5.232E-05 | global batch size: 256 | lm loss: 1.898507E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.165 | TFLOPs: 40.35 | 15: iteration 90850/ 125429 | consumed samples: 23257600 | consumed tokens: 47631564800 | elapsed time per iteration (s): 1.06 | learning rate: 5.230E-05 | global batch size: 256 | lm loss: 1.940233E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.898 | TFLOPs: 39.98 | 15: iteration 90860/ 125429 | consumed samples: 23260160 | consumed tokens: 47636807680 | elapsed time per iteration (s): 1.03 | learning rate: 5.228E-05 | global batch size: 256 | lm loss: 1.899957E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.036 | TFLOPs: 41.16 | 15: iteration 90870/ 125429 | consumed samples: 23262720 | consumed tokens: 47642050560 | elapsed time per iteration (s): 1.03 | learning rate: 5.226E-05 | global batch size: 256 | lm loss: 1.921605E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.380 | TFLOPs: 41.05 | 15: iteration 90880/ 125429 | consumed samples: 23265280 | consumed tokens: 47647293440 | elapsed time per iteration (s): 1.03 | learning rate: 5.225E-05 | global batch size: 256 | lm loss: 1.937069E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.893 | TFLOPs: 40.97 | 15: iteration 90890/ 125429 | consumed samples: 23267840 | consumed tokens: 47652536320 | elapsed time per iteration (s): 1.08 | learning rate: 5.223E-05 | global batch size: 256 | lm loss: 1.958530E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.006 | TFLOPs: 39.17 | 15: iteration 90900/ 125429 | consumed samples: 23270400 | consumed tokens: 47657779200 | elapsed time per iteration (s): 1.07 | learning rate: 5.221E-05 | global batch size: 256 | lm loss: 1.911627E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.563 | TFLOPs: 39.42 | 15: iteration 90910/ 125429 | consumed samples: 23272960 | consumed tokens: 47663022080 | elapsed time per iteration (s): 1.06 | learning rate: 5.219E-05 | global batch size: 256 | lm loss: 1.937174E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.150 | TFLOPs: 39.85 | 15: iteration 90920/ 125429 | consumed samples: 23275520 | consumed tokens: 47668264960 | elapsed time per iteration (s): 1.05 | learning rate: 5.218E-05 | global batch size: 256 | lm loss: 1.926502E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.154 | TFLOPs: 40.35 | 15: iteration 90930/ 125429 | consumed samples: 23278080 | consumed tokens: 47673507840 | elapsed time per iteration (s): 1.08 | learning rate: 5.216E-05 | global batch size: 256 | lm loss: 1.925291E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.770 | TFLOPs: 39.13 | 15: iteration 90940/ 125429 | consumed samples: 23280640 | consumed tokens: 47678750720 | elapsed time per iteration (s): 1.03 | learning rate: 5.214E-05 | global batch size: 256 | lm loss: 1.907196E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.711 | TFLOPs: 41.10 | 15: iteration 90950/ 125429 | consumed samples: 23283200 | consumed tokens: 47683993600 | elapsed time per iteration (s): 1.10 | learning rate: 5.212E-05 | global batch size: 256 | lm loss: 1.914784E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.467 | TFLOPs: 38.42 | 15: iteration 90960/ 125429 | consumed samples: 23285760 | consumed tokens: 47689236480 | elapsed time per iteration (s): 1.02 | learning rate: 5.211E-05 | global batch size: 256 | lm loss: 1.938013E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.284 | TFLOPs: 41.53 | 15: iteration 90970/ 125429 | consumed samples: 23288320 | consumed tokens: 47694479360 | elapsed time per iteration (s): 1.09 | learning rate: 5.209E-05 | global batch size: 256 | lm loss: 1.936914E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.134 | TFLOPs: 38.86 | 15: iteration 90980/ 125429 | consumed samples: 23290880 | consumed tokens: 47699722240 | elapsed time per iteration (s): 2.66 | learning rate: 5.207E-05 | global batch size: 256 | lm loss: 1.920856E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 96.243 | TFLOPs: 15.90 | 15: iteration 90990/ 125429 | consumed samples: 23293440 | consumed tokens: 47704965120 | elapsed time per iteration (s): 1.04 | learning rate: 5.206E-05 | global batch size: 256 | lm loss: 1.909878E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.030 | TFLOPs: 40.66 | 15: iteration 91000/ 125429 | consumed samples: 23296000 | consumed tokens: 47710208000 | elapsed time per iteration (s): 1.03 | learning rate: 5.204E-05 | global batch size: 256 | lm loss: 1.931121E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.659 | TFLOPs: 40.93 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 91000 | lm loss value: 1.882493E+00 | lm loss PPL: 6.569862E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 91000 to checkpoints_1b5 0: [2022-11-26 23:00:08,652] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step91000 is begin to save! 0: [2022-11-26 23:00:08,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:00:08,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:00:08,888] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:00:08,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:00:08,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:00:09,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:00:09,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:00:09,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:00:09,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:00:09,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:00:09,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:00:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:00:09,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:00:09,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:00:09,516] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:00:09,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:00:09,628] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:00:09,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:00:09,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:00:09,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:00:09,834] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:00:09,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:00:09,939] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:00:10,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:00:10,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:00:10,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:00:10,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:00:10,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:00:10,260] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:00:10,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:00:10,366] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:00:10,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:00:10,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:00:10,584] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:00:10,584] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:00:10,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:00:10,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:00:10,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:00:10,796] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:00:10,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:00:10,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:00:11,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:00:11,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:00:11,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:00:11,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:00:11,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:00:11,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:00:11,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:00:11,330] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:00:11,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:00:11,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:00:11,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:00:11,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:00:11,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:00:11,653] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_29-model_00-model_states.pt... 0: [2022-11-26 23:00:11,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_29-model_00-model_states.pt. 0: [2022-11-26 23:00:11,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:00:11,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:00:11,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/layer_32-model_00-model_states.pt... 0: [2022-11-26 23:00:11,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/layer_32-model_00-model_states.pt. 0: [2022-11-26 23:00:11,872] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step91000/mp_rank_00_model_states.pt 0: [2022-11-26 23:00:11,872] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:00:11,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:00:11,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:00:11,915] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step91000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:00:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 23:00:12,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 23:00:12,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 23:00:12,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:00:12,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 23:00:12,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 23:00:12,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 23:00:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:00:12,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 13: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:00:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:00:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:00:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:00:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 23:00:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:00:12,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 23:00:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 23:00:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:00:12,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 23:00:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 23:00:12,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:00:12,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 23:00:12,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:00:12,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 23:00:12,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:00:12,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:00:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:00:12,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:00:12,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 23:00:12,105] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:00:12,105] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:00:12,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 23:00:12,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 14: [2022-11-26 23:00:12,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:00:12,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:00:12,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:00:12,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:00:12,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 23:00:12,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 23:00:12,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 13: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:00:12,110] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:00:12,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:00:12,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 2: [2022-11-26 23:00:12,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:00:12,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 23:00:12,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 23:00:12,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,120] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 23:00:12,120] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 7: [2022-11-26 23:00:12,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:00:12,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 23:00:12,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 5: [2022-11-26 23:00:12,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:00:12,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:00:12,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:00:12,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:00:12,130] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,130] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:00:12,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:00:12,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 23:00:12,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 23:00:12,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:00:12,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 10: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 10: [2022-11-26 23:00:12,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 9: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 9: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 11: [2022-11-26 23:00:12,137] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 12: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:00:12,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 23:00:12,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 23:00:12,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 23:00:12,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 8: [2022-11-26 23:00:12,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:00:12,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:00:12,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 4: [2022-11-26 23:00:12,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:00:12,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:00:12,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:00:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,093] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,093] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:00:12,095] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,095] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:00:12,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:00:12,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:00:12,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:00:12,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,117] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 23:00:12,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 15: [2022-11-26 23:00:12,117] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 1: [2022-11-26 23:00:12,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:00:12,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:00:12,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,169] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,169] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:00:12,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 6: [2022-11-26 23:00:12,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:00:12,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 23:00:12,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:00:12,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: [2022-11-26 23:00:12,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 23:00:12,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step91000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 3: [2022-11-26 23:00:12,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step91000 is ready now! 0: successfully saved checkpoint at iteration 91000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3780.78 15: iteration 91010/ 125429 | consumed samples: 23298560 | consumed tokens: 47715450880 | elapsed time per iteration (s): 1.45 | learning rate: 5.202E-05 | global batch size: 256 | lm loss: 1.915542E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.966 | TFLOPs: 29.08 | 15: iteration 91020/ 125429 | consumed samples: 23301120 | consumed tokens: 47720693760 | elapsed time per iteration (s): 1.05 | learning rate: 5.200E-05 | global batch size: 256 | lm loss: 1.947663E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.887 | TFLOPs: 40.14 | 15: iteration 91030/ 125429 | consumed samples: 23303680 | consumed tokens: 47725936640 | elapsed time per iteration (s): 1.04 | learning rate: 5.199E-05 | global batch size: 256 | lm loss: 1.878646E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.383 | TFLOPs: 40.55 | 15: iteration 91040/ 125429 | consumed samples: 23306240 | consumed tokens: 47731179520 | elapsed time per iteration (s): 1.05 | learning rate: 5.197E-05 | global batch size: 256 | lm loss: 1.927487E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.917 | TFLOPs: 40.47 | 15: iteration 91050/ 125429 | consumed samples: 23308800 | consumed tokens: 47736422400 | elapsed time per iteration (s): 1.03 | learning rate: 5.195E-05 | global batch size: 256 | lm loss: 1.961262E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.572 | TFLOPs: 41.08 | 15: iteration 91060/ 125429 | consumed samples: 23311360 | consumed tokens: 47741665280 | elapsed time per iteration (s): 1.21 | learning rate: 5.193E-05 | global batch size: 256 | lm loss: 1.909168E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.775 | TFLOPs: 34.83 | 15: iteration 91070/ 125429 | consumed samples: 23313920 | consumed tokens: 47746908160 | elapsed time per iteration (s): 1.05 | learning rate: 5.192E-05 | global batch size: 256 | lm loss: 1.953201E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.707 | TFLOPs: 40.44 | 15: iteration 91080/ 125429 | consumed samples: 23316480 | consumed tokens: 47752151040 | elapsed time per iteration (s): 1.08 | learning rate: 5.190E-05 | global batch size: 256 | lm loss: 1.930413E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.129 | TFLOPs: 39.19 | 15: iteration 91090/ 125429 | consumed samples: 23319040 | consumed tokens: 47757393920 | elapsed time per iteration (s): 1.04 | learning rate: 5.188E-05 | global batch size: 256 | lm loss: 1.900913E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.099 | TFLOPs: 40.50 | 15: iteration 91100/ 125429 | consumed samples: 23321600 | consumed tokens: 47762636800 | elapsed time per iteration (s): 1.03 | learning rate: 5.186E-05 | global batch size: 256 | lm loss: 1.911031E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.984 | TFLOPs: 40.98 | 15: iteration 91110/ 125429 | consumed samples: 23324160 | consumed tokens: 47767879680 | elapsed time per iteration (s): 1.08 | learning rate: 5.185E-05 | global batch size: 256 | lm loss: 1.924757E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.071 | TFLOPs: 39.34 | 15: iteration 91120/ 125429 | consumed samples: 23326720 | consumed tokens: 47773122560 | elapsed time per iteration (s): 1.04 | learning rate: 5.183E-05 | global batch size: 256 | lm loss: 1.963740E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.577 | TFLOPs: 40.75 | 15: iteration 91130/ 125429 | consumed samples: 23329280 | consumed tokens: 47778365440 | elapsed time per iteration (s): 1.05 | learning rate: 5.181E-05 | global batch size: 256 | lm loss: 1.919792E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.215 | TFLOPs: 40.36 | 15: iteration 91140/ 125429 | consumed samples: 23331840 | consumed tokens: 47783608320 | elapsed time per iteration (s): 1.04 | learning rate: 5.179E-05 | global batch size: 256 | lm loss: 1.929683E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.613 | TFLOPs: 40.75 | 15: iteration 91150/ 125429 | consumed samples: 23334400 | consumed tokens: 47788851200 | elapsed time per iteration (s): 1.02 | learning rate: 5.178E-05 | global batch size: 256 | lm loss: 1.933621E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.837 | TFLOPs: 41.29 | 15: iteration 91160/ 125429 | consumed samples: 23336960 | consumed tokens: 47794094080 | elapsed time per iteration (s): 1.06 | learning rate: 5.176E-05 | global batch size: 256 | lm loss: 1.940332E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.733 | TFLOPs: 39.95 | 15: iteration 91170/ 125429 | consumed samples: 23339520 | consumed tokens: 47799336960 | elapsed time per iteration (s): 1.05 | learning rate: 5.174E-05 | global batch size: 256 | lm loss: 1.926674E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.818 | TFLOPs: 40.13 | 15: iteration 91180/ 125429 | consumed samples: 23342080 | consumed tokens: 47804579840 | elapsed time per iteration (s): 1.08 | learning rate: 5.172E-05 | global batch size: 256 | lm loss: 1.917808E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.110 | TFLOPs: 39.35 | 15: iteration 91190/ 125429 | consumed samples: 23344640 | consumed tokens: 47809822720 | elapsed time per iteration (s): 1.06 | learning rate: 5.171E-05 | global batch size: 256 | lm loss: 1.943642E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.372 | TFLOPs: 40.05 | 15: iteration 91200/ 125429 | consumed samples: 23347200 | consumed tokens: 47815065600 | elapsed time per iteration (s): 1.04 | learning rate: 5.169E-05 | global batch size: 256 | lm loss: 1.939947E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.034 | TFLOPs: 40.66 | 15: iteration 91210/ 125429 | consumed samples: 23349760 | consumed tokens: 47820308480 | elapsed time per iteration (s): 1.04 | learning rate: 5.167E-05 | global batch size: 256 | lm loss: 1.957505E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.087 | TFLOPs: 40.50 | 15: iteration 91220/ 125429 | consumed samples: 23352320 | consumed tokens: 47825551360 | elapsed time per iteration (s): 1.05 | learning rate: 5.166E-05 | global batch size: 256 | lm loss: 1.927909E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.111 | TFLOPs: 40.34 | 15: iteration 91230/ 125429 | consumed samples: 23354880 | consumed tokens: 47830794240 | elapsed time per iteration (s): 1.05 | learning rate: 5.164E-05 | global batch size: 256 | lm loss: 1.923884E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.024 | TFLOPs: 40.16 | 15: iteration 91240/ 125429 | consumed samples: 23357440 | consumed tokens: 47836037120 | elapsed time per iteration (s): 1.09 | learning rate: 5.162E-05 | global batch size: 256 | lm loss: 1.916364E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.787 | TFLOPs: 38.97 | 15: iteration 91250/ 125429 | consumed samples: 23360000 | consumed tokens: 47841280000 | elapsed time per iteration (s): 1.06 | learning rate: 5.160E-05 | global batch size: 256 | lm loss: 1.920302E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.474 | TFLOPs: 39.74 | 15: iteration 91260/ 125429 | consumed samples: 23362560 | consumed tokens: 47846522880 | elapsed time per iteration (s): 1.03 | learning rate: 5.159E-05 | global batch size: 256 | lm loss: 1.927303E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.797 | TFLOPs: 40.95 | 15: iteration 91270/ 125429 | consumed samples: 23365120 | consumed tokens: 47851765760 | elapsed time per iteration (s): 1.06 | learning rate: 5.157E-05 | global batch size: 256 | lm loss: 1.957964E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.719 | TFLOPs: 39.78 | 15: iteration 91280/ 125429 | consumed samples: 23367680 | consumed tokens: 47857008640 | elapsed time per iteration (s): 1.03 | learning rate: 5.155E-05 | global batch size: 256 | lm loss: 1.894165E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.967 | TFLOPs: 40.98 | 15: iteration 91290/ 125429 | consumed samples: 23370240 | consumed tokens: 47862251520 | elapsed time per iteration (s): 1.04 | learning rate: 5.153E-05 | global batch size: 256 | lm loss: 1.926519E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.438 | TFLOPs: 40.56 | 15: iteration 91300/ 125429 | consumed samples: 23372800 | consumed tokens: 47867494400 | elapsed time per iteration (s): 1.04 | learning rate: 5.152E-05 | global batch size: 256 | lm loss: 1.916120E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.062 | TFLOPs: 40.66 | 15: iteration 91310/ 125429 | consumed samples: 23375360 | consumed tokens: 47872737280 | elapsed time per iteration (s): 1.03 | learning rate: 5.150E-05 | global batch size: 256 | lm loss: 1.907858E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.062 | TFLOPs: 40.99 | 15: iteration 91320/ 125429 | consumed samples: 23377920 | consumed tokens: 47877980160 | elapsed time per iteration (s): 1.09 | learning rate: 5.148E-05 | global batch size: 256 | lm loss: 1.922597E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.620 | TFLOPs: 38.94 | 15: iteration 91330/ 125429 | consumed samples: 23380480 | consumed tokens: 47883223040 | elapsed time per iteration (s): 1.03 | learning rate: 5.146E-05 | global batch size: 256 | lm loss: 1.903353E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.309 | TFLOPs: 41.20 | 15: iteration 91340/ 125429 | consumed samples: 23383040 | consumed tokens: 47888465920 | elapsed time per iteration (s): 1.09 | learning rate: 5.145E-05 | global batch size: 256 | lm loss: 1.929231E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.729 | TFLOPs: 38.96 | 15: iteration 91350/ 125429 | consumed samples: 23385600 | consumed tokens: 47893708800 | elapsed time per iteration (s): 1.06 | learning rate: 5.143E-05 | global batch size: 256 | lm loss: 1.923424E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.811 | TFLOPs: 39.96 | 15: iteration 91360/ 125429 | consumed samples: 23388160 | consumed tokens: 47898951680 | elapsed time per iteration (s): 1.07 | learning rate: 5.141E-05 | global batch size: 256 | lm loss: 1.901056E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.055 | TFLOPs: 39.51 | 15: iteration 91370/ 125429 | consumed samples: 23390720 | consumed tokens: 47904194560 | elapsed time per iteration (s): 1.03 | learning rate: 5.140E-05 | global batch size: 256 | lm loss: 1.881417E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.885 | TFLOPs: 41.13 | 15: iteration 91380/ 125429 | consumed samples: 23393280 | consumed tokens: 47909437440 | elapsed time per iteration (s): 1.10 | learning rate: 5.138E-05 | global batch size: 256 | lm loss: 1.912584E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.591 | TFLOPs: 38.60 | 15: iteration 91390/ 125429 | consumed samples: 23395840 | consumed tokens: 47914680320 | elapsed time per iteration (s): 1.07 | learning rate: 5.136E-05 | global batch size: 256 | lm loss: 1.920207E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.830 | TFLOPs: 39.63 | 15: iteration 91400/ 125429 | consumed samples: 23398400 | consumed tokens: 47919923200 | elapsed time per iteration (s): 1.05 | learning rate: 5.134E-05 | global batch size: 256 | lm loss: 1.910802E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.515 | TFLOPs: 40.41 | 15: iteration 91410/ 125429 | consumed samples: 23400960 | consumed tokens: 47925166080 | elapsed time per iteration (s): 1.14 | learning rate: 5.133E-05 | global batch size: 256 | lm loss: 1.911564E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.335 | TFLOPs: 37.24 | 15: iteration 91420/ 125429 | consumed samples: 23403520 | consumed tokens: 47930408960 | elapsed time per iteration (s): 1.04 | learning rate: 5.131E-05 | global batch size: 256 | lm loss: 1.936648E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.160 | TFLOPs: 40.51 | 15: iteration 91430/ 125429 | consumed samples: 23406080 | consumed tokens: 47935651840 | elapsed time per iteration (s): 1.04 | learning rate: 5.129E-05 | global batch size: 256 | lm loss: 1.927254E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.583 | TFLOPs: 40.58 | 15: iteration 91440/ 125429 | consumed samples: 23408640 | consumed tokens: 47940894720 | elapsed time per iteration (s): 1.03 | learning rate: 5.127E-05 | global batch size: 256 | lm loss: 1.948518E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.101 | TFLOPs: 41.00 | 15: iteration 91450/ 125429 | consumed samples: 23411200 | consumed tokens: 47946137600 | elapsed time per iteration (s): 1.04 | learning rate: 5.126E-05 | global batch size: 256 | lm loss: 1.930450E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.066 | TFLOPs: 40.66 | 15: iteration 91460/ 125429 | consumed samples: 23413760 | consumed tokens: 47951380480 | elapsed time per iteration (s): 1.05 | learning rate: 5.124E-05 | global batch size: 256 | lm loss: 1.928314E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.699 | TFLOPs: 40.11 | 15: iteration 91470/ 125429 | consumed samples: 23416320 | consumed tokens: 47956623360 | elapsed time per iteration (s): 1.05 | learning rate: 5.122E-05 | global batch size: 256 | lm loss: 1.937070E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.408 | TFLOPs: 40.23 | 15: iteration 91480/ 125429 | consumed samples: 23418880 | consumed tokens: 47961866240 | elapsed time per iteration (s): 1.06 | learning rate: 5.121E-05 | global batch size: 256 | lm loss: 1.917513E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.209 | TFLOPs: 40.03 | 15: iteration 91490/ 125429 | consumed samples: 23421440 | consumed tokens: 47967109120 | elapsed time per iteration (s): 1.03 | learning rate: 5.119E-05 | global batch size: 256 | lm loss: 1.936178E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.780 | TFLOPs: 41.11 | 15: iteration 91500/ 125429 | consumed samples: 23424000 | consumed tokens: 47972352000 | elapsed time per iteration (s): 1.05 | learning rate: 5.117E-05 | global batch size: 256 | lm loss: 1.934915E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.208 | TFLOPs: 40.36 | 15: iteration 91510/ 125429 | consumed samples: 23426560 | consumed tokens: 47977594880 | elapsed time per iteration (s): 1.05 | learning rate: 5.115E-05 | global batch size: 256 | lm loss: 1.888662E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.302 | TFLOPs: 40.37 | 15: iteration 91520/ 125429 | consumed samples: 23429120 | consumed tokens: 47982837760 | elapsed time per iteration (s): 1.05 | learning rate: 5.114E-05 | global batch size: 256 | lm loss: 1.938338E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.264 | TFLOPs: 40.37 | 15: iteration 91530/ 125429 | consumed samples: 23431680 | consumed tokens: 47988080640 | elapsed time per iteration (s): 1.07 | learning rate: 5.112E-05 | global batch size: 256 | lm loss: 1.965981E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.691 | TFLOPs: 39.45 | 15: iteration 91540/ 125429 | consumed samples: 23434240 | consumed tokens: 47993323520 | elapsed time per iteration (s): 1.03 | learning rate: 5.110E-05 | global batch size: 256 | lm loss: 1.935855E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.355 | TFLOPs: 41.04 | 15: iteration 91550/ 125429 | consumed samples: 23436800 | consumed tokens: 47998566400 | elapsed time per iteration (s): 1.06 | learning rate: 5.109E-05 | global batch size: 256 | lm loss: 1.910769E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.361 | TFLOPs: 40.05 | 15: iteration 91560/ 125429 | consumed samples: 23439360 | consumed tokens: 48003809280 | elapsed time per iteration (s): 1.03 | learning rate: 5.107E-05 | global batch size: 256 | lm loss: 1.950965E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.487 | TFLOPs: 41.23 | 15: iteration 91570/ 125429 | consumed samples: 23441920 | consumed tokens: 48009052160 | elapsed time per iteration (s): 1.06 | learning rate: 5.105E-05 | global batch size: 256 | lm loss: 1.937974E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.536 | TFLOPs: 39.92 | 15: iteration 91580/ 125429 | consumed samples: 23444480 | consumed tokens: 48014295040 | elapsed time per iteration (s): 1.02 | learning rate: 5.103E-05 | global batch size: 256 | lm loss: 1.924396E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.268 | TFLOPs: 41.36 | 15: iteration 91590/ 125429 | consumed samples: 23447040 | consumed tokens: 48019537920 | elapsed time per iteration (s): 1.07 | learning rate: 5.102E-05 | global batch size: 256 | lm loss: 1.924851E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.707 | TFLOPs: 39.61 | 15: iteration 91600/ 125429 | consumed samples: 23449600 | consumed tokens: 48024780800 | elapsed time per iteration (s): 1.06 | learning rate: 5.100E-05 | global batch size: 256 | lm loss: 1.934071E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.073 | TFLOPs: 40.00 | 15: iteration 91610/ 125429 | consumed samples: 23452160 | consumed tokens: 48030023680 | elapsed time per iteration (s): 1.04 | learning rate: 5.098E-05 | global batch size: 256 | lm loss: 1.912802E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.519 | TFLOPs: 40.74 | 15: iteration 91620/ 125429 | consumed samples: 23454720 | consumed tokens: 48035266560 | elapsed time per iteration (s): 1.06 | learning rate: 5.096E-05 | global batch size: 256 | lm loss: 1.948436E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.374 | TFLOPs: 39.89 | 15: iteration 91630/ 125429 | consumed samples: 23457280 | consumed tokens: 48040509440 | elapsed time per iteration (s): 1.05 | learning rate: 5.095E-05 | global batch size: 256 | lm loss: 1.918105E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.967 | TFLOPs: 40.48 | 15: iteration 91640/ 125429 | consumed samples: 23459840 | consumed tokens: 48045752320 | elapsed time per iteration (s): 1.06 | learning rate: 5.093E-05 | global batch size: 256 | lm loss: 1.943305E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.921 | TFLOPs: 39.81 | 15: iteration 91650/ 125429 | consumed samples: 23462400 | consumed tokens: 48050995200 | elapsed time per iteration (s): 1.03 | learning rate: 5.091E-05 | global batch size: 256 | lm loss: 1.903234E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.084 | TFLOPs: 41.16 | 15: iteration 91660/ 125429 | consumed samples: 23464960 | consumed tokens: 48056238080 | elapsed time per iteration (s): 1.05 | learning rate: 5.090E-05 | global batch size: 256 | lm loss: 1.884309E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.968 | TFLOPs: 40.48 | 15: iteration 91670/ 125429 | consumed samples: 23467520 | consumed tokens: 48061480960 | elapsed time per iteration (s): 1.20 | learning rate: 5.088E-05 | global batch size: 256 | lm loss: 1.915335E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.043 | TFLOPs: 35.37 | 15: iteration 91680/ 125429 | consumed samples: 23470080 | consumed tokens: 48066723840 | elapsed time per iteration (s): 1.07 | learning rate: 5.086E-05 | global batch size: 256 | lm loss: 1.937923E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.081 | TFLOPs: 39.68 | 15: iteration 91690/ 125429 | consumed samples: 23472640 | consumed tokens: 48071966720 | elapsed time per iteration (s): 1.10 | learning rate: 5.084E-05 | global batch size: 256 | lm loss: 1.896598E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.316 | TFLOPs: 38.56 | 15: iteration 91700/ 125429 | consumed samples: 23475200 | consumed tokens: 48077209600 | elapsed time per iteration (s): 1.03 | learning rate: 5.083E-05 | global batch size: 256 | lm loss: 1.926064E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.160 | TFLOPs: 41.01 | 15: iteration 91710/ 125429 | consumed samples: 23477760 | consumed tokens: 48082452480 | elapsed time per iteration (s): 1.05 | learning rate: 5.081E-05 | global batch size: 256 | lm loss: 1.920034E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.357 | TFLOPs: 40.38 | 15: iteration 91720/ 125429 | consumed samples: 23480320 | consumed tokens: 48087695360 | elapsed time per iteration (s): 1.17 | learning rate: 5.079E-05 | global batch size: 256 | lm loss: 1.924292E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.063 | TFLOPs: 36.04 | 15: iteration 91730/ 125429 | consumed samples: 23482880 | consumed tokens: 48092938240 | elapsed time per iteration (s): 1.03 | learning rate: 5.078E-05 | global batch size: 256 | lm loss: 1.912753E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.375 | TFLOPs: 41.05 | 15: iteration 91740/ 125429 | consumed samples: 23485440 | consumed tokens: 48098181120 | elapsed time per iteration (s): 1.06 | learning rate: 5.076E-05 | global batch size: 256 | lm loss: 1.960567E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.382 | TFLOPs: 39.72 | 15: iteration 91750/ 125429 | consumed samples: 23488000 | consumed tokens: 48103424000 | elapsed time per iteration (s): 1.09 | learning rate: 5.074E-05 | global batch size: 256 | lm loss: 1.899289E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.242 | TFLOPs: 38.71 | 15: iteration 91760/ 125429 | consumed samples: 23490560 | consumed tokens: 48108666880 | elapsed time per iteration (s): 1.05 | learning rate: 5.072E-05 | global batch size: 256 | lm loss: 1.920399E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.115 | TFLOPs: 40.34 | 15: iteration 91770/ 125429 | consumed samples: 23493120 | consumed tokens: 48113909760 | elapsed time per iteration (s): 1.04 | learning rate: 5.071E-05 | global batch size: 256 | lm loss: 1.922563E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.528 | TFLOPs: 40.58 | 15: iteration 91780/ 125429 | consumed samples: 23495680 | consumed tokens: 48119152640 | elapsed time per iteration (s): 1.07 | learning rate: 5.069E-05 | global batch size: 256 | lm loss: 1.904727E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.475 | TFLOPs: 39.41 | 15: iteration 91790/ 125429 | consumed samples: 23498240 | consumed tokens: 48124395520 | elapsed time per iteration (s): 1.04 | learning rate: 5.067E-05 | global batch size: 256 | lm loss: 1.914245E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.300 | TFLOPs: 40.54 | 15: iteration 91800/ 125429 | consumed samples: 23500800 | consumed tokens: 48129638400 | elapsed time per iteration (s): 1.05 | learning rate: 5.066E-05 | global batch size: 256 | lm loss: 1.915757E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.503 | TFLOPs: 40.41 | 15: iteration 91810/ 125429 | consumed samples: 23503360 | consumed tokens: 48134881280 | elapsed time per iteration (s): 1.05 | learning rate: 5.064E-05 | global batch size: 256 | lm loss: 1.906780E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.546 | TFLOPs: 40.25 | 15: iteration 91820/ 125429 | consumed samples: 23505920 | consumed tokens: 48140124160 | elapsed time per iteration (s): 1.04 | learning rate: 5.062E-05 | global batch size: 256 | lm loss: 1.917006E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.750 | TFLOPs: 40.78 | 15: iteration 91830/ 125429 | consumed samples: 23508480 | consumed tokens: 48145367040 | elapsed time per iteration (s): 1.03 | learning rate: 5.060E-05 | global batch size: 256 | lm loss: 1.892541E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.586 | TFLOPs: 41.08 | 15: iteration 91840/ 125429 | consumed samples: 23511040 | consumed tokens: 48150609920 | elapsed time per iteration (s): 1.03 | learning rate: 5.059E-05 | global batch size: 256 | lm loss: 1.913045E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.297 | TFLOPs: 41.03 | 15: iteration 91850/ 125429 | consumed samples: 23513600 | consumed tokens: 48155852800 | elapsed time per iteration (s): 1.05 | learning rate: 5.057E-05 | global batch size: 256 | lm loss: 1.911280E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.382 | TFLOPs: 40.39 | 15: iteration 91860/ 125429 | consumed samples: 23516160 | consumed tokens: 48161095680 | elapsed time per iteration (s): 1.07 | learning rate: 5.055E-05 | global batch size: 256 | lm loss: 1.920468E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.841 | TFLOPs: 39.64 | 15: iteration 91870/ 125429 | consumed samples: 23518720 | consumed tokens: 48166338560 | elapsed time per iteration (s): 1.03 | learning rate: 5.054E-05 | global batch size: 256 | lm loss: 1.915420E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.522 | TFLOPs: 41.07 | 15: iteration 91880/ 125429 | consumed samples: 23521280 | consumed tokens: 48171581440 | elapsed time per iteration (s): 1.05 | learning rate: 5.052E-05 | global batch size: 256 | lm loss: 1.937642E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.552 | TFLOPs: 40.41 | 15: iteration 91890/ 125429 | consumed samples: 23523840 | consumed tokens: 48176824320 | elapsed time per iteration (s): 1.07 | learning rate: 5.050E-05 | global batch size: 256 | lm loss: 1.937888E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.909 | TFLOPs: 39.65 | 15: iteration 91900/ 125429 | consumed samples: 23526400 | consumed tokens: 48182067200 | elapsed time per iteration (s): 1.03 | learning rate: 5.049E-05 | global batch size: 256 | lm loss: 1.934362E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.881 | TFLOPs: 40.96 | 15: iteration 91910/ 125429 | consumed samples: 23528960 | consumed tokens: 48187310080 | elapsed time per iteration (s): 1.08 | learning rate: 5.047E-05 | global batch size: 256 | lm loss: 1.927424E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.157 | TFLOPs: 39.03 | 15: iteration 91920/ 125429 | consumed samples: 23531520 | consumed tokens: 48192552960 | elapsed time per iteration (s): 1.07 | learning rate: 5.045E-05 | global batch size: 256 | lm loss: 1.934188E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.899 | TFLOPs: 39.48 | 15: iteration 91930/ 125429 | consumed samples: 23534080 | consumed tokens: 48197795840 | elapsed time per iteration (s): 1.06 | learning rate: 5.043E-05 | global batch size: 256 | lm loss: 1.931341E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.076 | TFLOPs: 39.84 | 15: iteration 91940/ 125429 | consumed samples: 23536640 | consumed tokens: 48203038720 | elapsed time per iteration (s): 1.06 | learning rate: 5.042E-05 | global batch size: 256 | lm loss: 1.895077E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.479 | TFLOPs: 40.07 | 15: iteration 91950/ 125429 | consumed samples: 23539200 | consumed tokens: 48208281600 | elapsed time per iteration (s): 1.19 | learning rate: 5.040E-05 | global batch size: 256 | lm loss: 1.941311E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.752 | TFLOPs: 35.49 | 15: iteration 91960/ 125429 | consumed samples: 23541760 | consumed tokens: 48213524480 | elapsed time per iteration (s): 1.03 | learning rate: 5.038E-05 | global batch size: 256 | lm loss: 1.890726E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.419 | TFLOPs: 40.89 | 15: iteration 91970/ 125429 | consumed samples: 23544320 | consumed tokens: 48218767360 | elapsed time per iteration (s): 1.03 | learning rate: 5.037E-05 | global batch size: 256 | lm loss: 1.904319E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.752 | TFLOPs: 40.94 | 15: iteration 91980/ 125429 | consumed samples: 23546880 | consumed tokens: 48224010240 | elapsed time per iteration (s): 1.05 | learning rate: 5.035E-05 | global batch size: 256 | lm loss: 1.927823E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.887 | TFLOPs: 40.14 | 15: iteration 91990/ 125429 | consumed samples: 23549440 | consumed tokens: 48229253120 | elapsed time per iteration (s): 1.09 | learning rate: 5.033E-05 | global batch size: 256 | lm loss: 1.908041E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.355 | TFLOPs: 38.89 | 0: [2022-11-26 23:17:50,369] [INFO] [logging.py:68:log_dist] [Rank 0] step=92000, skipped=0, lr=[5.031451233752927e-05, 5.031451233752927e-05, 5.031451233752927e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 92000/ 125429 | consumed samples: 23552000 | consumed tokens: 48234496000 | elapsed time per iteration (s): 1.08 | learning rate: 5.031E-05 | global batch size: 256 | lm loss: 1.935931E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.477 | TFLOPs: 39.24 | 0: steps: 92000 loss: 1.9476 iter time (s): 1.059 samples/sec: 241.817 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 92000 | lm loss value: 1.792201E+00 | lm loss PPL: 6.002651E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 92000 to checkpoints_1b5 0: [2022-11-26 23:17:50,834] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step92000 is begin to save! 0: [2022-11-26 23:17:50,844] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:17:51,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:17:51,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:17:51,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:17:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:17:51,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:17:51,332] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:17:51,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:17:51,448] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:17:51,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:17:51,558] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:17:51,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:17:51,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:17:51,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:17:51,781] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:17:51,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:17:51,889] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:17:51,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:17:51,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:17:52,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:17:52,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:17:52,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:17:52,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:17:52,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:17:52,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:17:52,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:17:52,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:17:52,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:17:52,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:17:52,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:17:52,626] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:17:52,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:17:52,731] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:17:52,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:17:52,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:17:52,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:17:52,942] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:17:53,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:17:53,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:17:53,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:17:53,152] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:17:53,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:17:53,254] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:17:53,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:17:53,363] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:17:53,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:17:53,467] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:17:53,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:17:53,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:17:53,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:17:53,675] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:17:53,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:17:53,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:17:53,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:17:53,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_29-model_00-model_states.pt... 0: [2022-11-26 23:17:53,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_29-model_00-model_states.pt. 0: [2022-11-26 23:17:53,986] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:17:54,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:17:54,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/layer_32-model_00-model_states.pt... 0: [2022-11-26 23:17:54,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/layer_32-model_00-model_states.pt. 0: [2022-11-26 23:17:54,096] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step92000/mp_rank_00_model_states.pt 0: [2022-11-26 23:17:54,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:17:54,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:17:54,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step92000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:17:54,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 23:17:54,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 23:17:54,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:17:54,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 23:17:54,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 23:17:54,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 23:17:54,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:17:54,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 23:17:54,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 23:17:54,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:17:54,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 23:17:54,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:17:54,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 23:17:54,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:17:54,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:17:54,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 0: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:17:54,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:17:54,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 23:17:54,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:17:54,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 23:17:54,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 23:17:54,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,307] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,307] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 23:17:54,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:17:54,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 23:17:54,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 23:17:54,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:17:54,325] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,325] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:17:54,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,326] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 23:17:54,326] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 23:17:54,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,328] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,328] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 23:17:54,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 23:17:54,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 23:17:54,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:17:54,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:17:54,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 23:17:54,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 12: [2022-11-26 23:17:54,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:17:54,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 12: [2022-11-26 23:17:54,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:17:54,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:17:54,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:17:54,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:17:54,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:17:54,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:17:54,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 23:17:54,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 23:17:54,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 7: [2022-11-26 23:17:54,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:17:54,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:17:54,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 2: [2022-11-26 23:17:54,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:17:54,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:17:54,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 3: [2022-11-26 23:17:54,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:17:54,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 23:17:54,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:17:54,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 23:17:54,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 1: [2022-11-26 23:17:54,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:17:54,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:17:54,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:17:54,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 15: [2022-11-26 23:17:54,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:17:54,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 14: [2022-11-26 23:17:54,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:17:54,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 23:17:54,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 23:17:54,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,358] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,358] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:17:54,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:17:54,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:17:54,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:17:54,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:17:54,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:17:54,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 13: [2022-11-26 23:17:54,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:17:54,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 23:17:54,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 9: [2022-11-26 23:17:54,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:17:54,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:17:54,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: [2022-11-26 23:17:54,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 23:17:54,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 11: [2022-11-26 23:17:54,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:17:54,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:17:54,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 23:17:54,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,387] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 23:17:54,387] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,387] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 23:17:54,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:17:54,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,389] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 23:17:54,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 10: [2022-11-26 23:17:54,389] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 4: [2022-11-26 23:17:54,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 8: [2022-11-26 23:17:54,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 23:17:54,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 6: [2022-11-26 23:17:54,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:17:54,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 23:17:54,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:17:54,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step92000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:17:54,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 5: [2022-11-26 23:17:54,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step92000 is ready now! 0: successfully saved checkpoint at iteration 92000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3788.09 15: iteration 92010/ 125429 | consumed samples: 23554560 | consumed tokens: 48239738880 | elapsed time per iteration (s): 1.47 | learning rate: 5.030E-05 | global batch size: 256 | lm loss: 1.910277E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.887 | TFLOPs: 28.74 | 15: iteration 92020/ 125429 | consumed samples: 23557120 | consumed tokens: 48244981760 | elapsed time per iteration (s): 1.04 | learning rate: 5.028E-05 | global batch size: 256 | lm loss: 1.902660E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.624 | TFLOPs: 40.59 | 15: iteration 92030/ 125429 | consumed samples: 23559680 | consumed tokens: 48250224640 | elapsed time per iteration (s): 1.06 | learning rate: 5.026E-05 | global batch size: 256 | lm loss: 1.942866E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.965 | TFLOPs: 39.99 | 15: iteration 92040/ 125429 | consumed samples: 23562240 | consumed tokens: 48255467520 | elapsed time per iteration (s): 1.04 | learning rate: 5.025E-05 | global batch size: 256 | lm loss: 1.891452E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.029 | TFLOPs: 40.82 | 15: iteration 92050/ 125429 | consumed samples: 23564800 | consumed tokens: 48260710400 | elapsed time per iteration (s): 1.48 | learning rate: 5.023E-05 | global batch size: 256 | lm loss: 1.910069E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.416 | TFLOPs: 28.49 | 15: iteration 92060/ 125429 | consumed samples: 23567360 | consumed tokens: 48265953280 | elapsed time per iteration (s): 1.03 | learning rate: 5.021E-05 | global batch size: 256 | lm loss: 1.915605E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.369 | TFLOPs: 41.04 | 15: iteration 92070/ 125429 | consumed samples: 23569920 | consumed tokens: 48271196160 | elapsed time per iteration (s): 1.20 | learning rate: 5.020E-05 | global batch size: 256 | lm loss: 1.932281E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.862 | TFLOPs: 35.18 | 15: iteration 92080/ 125429 | consumed samples: 23572480 | consumed tokens: 48276439040 | elapsed time per iteration (s): 1.04 | learning rate: 5.018E-05 | global batch size: 256 | lm loss: 1.898402E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.571 | TFLOPs: 40.58 | 15: iteration 92090/ 125429 | consumed samples: 23575040 | consumed tokens: 48281681920 | elapsed time per iteration (s): 1.08 | learning rate: 5.016E-05 | global batch size: 256 | lm loss: 1.921160E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.369 | TFLOPs: 39.23 | 15: iteration 92100/ 125429 | consumed samples: 23577600 | consumed tokens: 48286924800 | elapsed time per iteration (s): 1.05 | learning rate: 5.014E-05 | global batch size: 256 | lm loss: 1.924603E+00 | grad norm: 0.138 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.927 | TFLOPs: 40.48 | 15: iteration 92110/ 125429 | consumed samples: 23580160 | consumed tokens: 48292167680 | elapsed time per iteration (s): 1.05 | learning rate: 5.013E-05 | global batch size: 256 | lm loss: 1.948580E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.741 | TFLOPs: 40.45 | 15: iteration 92120/ 125429 | consumed samples: 23582720 | consumed tokens: 48297410560 | elapsed time per iteration (s): 1.04 | learning rate: 5.011E-05 | global batch size: 256 | lm loss: 1.951194E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.848 | TFLOPs: 40.63 | 15: iteration 92130/ 125429 | consumed samples: 23585280 | consumed tokens: 48302653440 | elapsed time per iteration (s): 1.04 | learning rate: 5.009E-05 | global batch size: 256 | lm loss: 1.887156E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.333 | TFLOPs: 40.87 | 15: iteration 92140/ 125429 | consumed samples: 23587840 | consumed tokens: 48307896320 | elapsed time per iteration (s): 1.03 | learning rate: 5.008E-05 | global batch size: 256 | lm loss: 1.927952E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.749 | TFLOPs: 41.27 | 15: iteration 92150/ 125429 | consumed samples: 23590400 | consumed tokens: 48313139200 | elapsed time per iteration (s): 1.02 | learning rate: 5.006E-05 | global batch size: 256 | lm loss: 1.912613E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.590 | TFLOPs: 41.58 | 15: iteration 92160/ 125429 | consumed samples: 23592960 | consumed tokens: 48318382080 | elapsed time per iteration (s): 1.04 | learning rate: 5.004E-05 | global batch size: 256 | lm loss: 1.940796E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.886 | TFLOPs: 40.80 | 15: iteration 92170/ 125429 | consumed samples: 23595520 | consumed tokens: 48323624960 | elapsed time per iteration (s): 1.10 | learning rate: 5.003E-05 | global batch size: 256 | lm loss: 1.936267E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.471 | TFLOPs: 38.42 | 15: iteration 92180/ 125429 | consumed samples: 23598080 | consumed tokens: 48328867840 | elapsed time per iteration (s): 1.05 | learning rate: 5.001E-05 | global batch size: 256 | lm loss: 1.902147E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.390 | TFLOPs: 40.39 | 15: iteration 92190/ 125429 | consumed samples: 23600640 | consumed tokens: 48334110720 | elapsed time per iteration (s): 1.06 | learning rate: 4.999E-05 | global batch size: 256 | lm loss: 1.899343E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.242 | TFLOPs: 39.87 | 15: iteration 92200/ 125429 | consumed samples: 23603200 | consumed tokens: 48339353600 | elapsed time per iteration (s): 1.06 | learning rate: 4.997E-05 | global batch size: 256 | lm loss: 1.937118E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.485 | TFLOPs: 39.74 | 15: iteration 92210/ 125429 | consumed samples: 23605760 | consumed tokens: 48344596480 | elapsed time per iteration (s): 1.04 | learning rate: 4.996E-05 | global batch size: 256 | lm loss: 1.923124E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.818 | TFLOPs: 40.79 | 15: iteration 92220/ 125429 | consumed samples: 23608320 | consumed tokens: 48349839360 | elapsed time per iteration (s): 1.06 | learning rate: 4.994E-05 | global batch size: 256 | lm loss: 1.901548E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.409 | TFLOPs: 39.73 | 15: iteration 92230/ 125429 | consumed samples: 23610880 | consumed tokens: 48355082240 | elapsed time per iteration (s): 1.04 | learning rate: 4.992E-05 | global batch size: 256 | lm loss: 1.941972E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.633 | TFLOPs: 40.76 | 15: iteration 92240/ 125429 | consumed samples: 23613440 | consumed tokens: 48360325120 | elapsed time per iteration (s): 1.06 | learning rate: 4.991E-05 | global batch size: 256 | lm loss: 1.921807E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.247 | TFLOPs: 40.03 | 15: iteration 92250/ 125429 | consumed samples: 23616000 | consumed tokens: 48365568000 | elapsed time per iteration (s): 1.07 | learning rate: 4.989E-05 | global batch size: 256 | lm loss: 1.912432E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.030 | TFLOPs: 39.67 | 15: iteration 92260/ 125429 | consumed samples: 23618560 | consumed tokens: 48370810880 | elapsed time per iteration (s): 1.06 | learning rate: 4.987E-05 | global batch size: 256 | lm loss: 1.971620E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.617 | TFLOPs: 40.09 | 15: iteration 92270/ 125429 | consumed samples: 23621120 | consumed tokens: 48376053760 | elapsed time per iteration (s): 1.02 | learning rate: 4.986E-05 | global batch size: 256 | lm loss: 1.945157E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.751 | TFLOPs: 41.44 | 15: iteration 92280/ 125429 | consumed samples: 23623680 | consumed tokens: 48381296640 | elapsed time per iteration (s): 1.03 | learning rate: 4.984E-05 | global batch size: 256 | lm loss: 1.933038E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.986 | TFLOPs: 40.98 | 15: iteration 92290/ 125429 | consumed samples: 23626240 | consumed tokens: 48386539520 | elapsed time per iteration (s): 1.05 | learning rate: 4.982E-05 | global batch size: 256 | lm loss: 1.933413E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.696 | TFLOPs: 40.27 | 15: iteration 92300/ 125429 | consumed samples: 23628800 | consumed tokens: 48391782400 | elapsed time per iteration (s): 1.05 | learning rate: 4.980E-05 | global batch size: 256 | lm loss: 1.926903E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.976 | TFLOPs: 40.48 | 15: iteration 92310/ 125429 | consumed samples: 23631360 | consumed tokens: 48397025280 | elapsed time per iteration (s): 1.07 | learning rate: 4.979E-05 | global batch size: 256 | lm loss: 1.942065E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.715 | TFLOPs: 39.61 | 15: iteration 92320/ 125429 | consumed samples: 23633920 | consumed tokens: 48402268160 | elapsed time per iteration (s): 1.05 | learning rate: 4.977E-05 | global batch size: 256 | lm loss: 1.901358E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.113 | TFLOPs: 40.18 | 15: iteration 92330/ 125429 | consumed samples: 23636480 | consumed tokens: 48407511040 | elapsed time per iteration (s): 1.12 | learning rate: 4.975E-05 | global batch size: 256 | lm loss: 1.897092E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.866 | TFLOPs: 37.82 | 15: iteration 92340/ 125429 | consumed samples: 23639040 | consumed tokens: 48412753920 | elapsed time per iteration (s): 1.05 | learning rate: 4.974E-05 | global batch size: 256 | lm loss: 1.918805E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.675 | TFLOPs: 40.10 | 15: iteration 92350/ 125429 | consumed samples: 23641600 | consumed tokens: 48417996800 | elapsed time per iteration (s): 1.05 | learning rate: 4.972E-05 | global batch size: 256 | lm loss: 1.917367E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.228 | TFLOPs: 40.20 | 15: iteration 92360/ 125429 | consumed samples: 23644160 | consumed tokens: 48423239680 | elapsed time per iteration (s): 1.10 | learning rate: 4.970E-05 | global batch size: 256 | lm loss: 1.903519E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.971 | TFLOPs: 38.50 | 15: iteration 92370/ 125429 | consumed samples: 23646720 | consumed tokens: 48428482560 | elapsed time per iteration (s): 1.03 | learning rate: 4.969E-05 | global batch size: 256 | lm loss: 1.924052E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.259 | TFLOPs: 41.19 | 15: iteration 92380/ 125429 | consumed samples: 23649280 | consumed tokens: 48433725440 | elapsed time per iteration (s): 1.07 | learning rate: 4.967E-05 | global batch size: 256 | lm loss: 1.918993E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.512 | TFLOPs: 39.42 | 15: iteration 92390/ 125429 | consumed samples: 23651840 | consumed tokens: 48438968320 | elapsed time per iteration (s): 1.03 | learning rate: 4.965E-05 | global batch size: 256 | lm loss: 1.916740E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.876 | TFLOPs: 40.96 | 15: iteration 92400/ 125429 | consumed samples: 23654400 | consumed tokens: 48444211200 | elapsed time per iteration (s): 1.05 | learning rate: 4.964E-05 | global batch size: 256 | lm loss: 1.907866E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.335 | TFLOPs: 40.21 | 15: iteration 92410/ 125429 | consumed samples: 23656960 | consumed tokens: 48449454080 | elapsed time per iteration (s): 1.05 | learning rate: 4.962E-05 | global batch size: 256 | lm loss: 1.912798E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.654 | TFLOPs: 40.10 | 15: iteration 92420/ 125429 | consumed samples: 23659520 | consumed tokens: 48454696960 | elapsed time per iteration (s): 1.05 | learning rate: 4.960E-05 | global batch size: 256 | lm loss: 1.900906E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.753 | TFLOPs: 40.28 | 15: iteration 92430/ 125429 | consumed samples: 23662080 | consumed tokens: 48459939840 | elapsed time per iteration (s): 1.04 | learning rate: 4.959E-05 | global batch size: 256 | lm loss: 1.950810E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.011 | TFLOPs: 40.66 | 15: iteration 92440/ 125429 | consumed samples: 23664640 | consumed tokens: 48465182720 | elapsed time per iteration (s): 1.06 | learning rate: 4.957E-05 | global batch size: 256 | lm loss: 1.925041E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.878 | TFLOPs: 39.81 | 15: iteration 92450/ 125429 | consumed samples: 23667200 | consumed tokens: 48470425600 | elapsed time per iteration (s): 1.07 | learning rate: 4.955E-05 | global batch size: 256 | lm loss: 1.922044E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.233 | TFLOPs: 39.70 | 15: iteration 92460/ 125429 | consumed samples: 23669760 | consumed tokens: 48475668480 | elapsed time per iteration (s): 1.08 | learning rate: 4.953E-05 | global batch size: 256 | lm loss: 1.924275E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.975 | TFLOPs: 39.33 | 15: iteration 92470/ 125429 | consumed samples: 23672320 | consumed tokens: 48480911360 | elapsed time per iteration (s): 1.03 | learning rate: 4.952E-05 | global batch size: 256 | lm loss: 1.947142E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.277 | TFLOPs: 41.19 | 15: iteration 92480/ 125429 | consumed samples: 23674880 | consumed tokens: 48486154240 | elapsed time per iteration (s): 1.03 | learning rate: 4.950E-05 | global batch size: 256 | lm loss: 1.936874E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.776 | TFLOPs: 40.95 | 15: iteration 92490/ 125429 | consumed samples: 23677440 | consumed tokens: 48491397120 | elapsed time per iteration (s): 1.04 | learning rate: 4.948E-05 | global batch size: 256 | lm loss: 1.906237E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.386 | TFLOPs: 40.72 | 15: iteration 92500/ 125429 | consumed samples: 23680000 | consumed tokens: 48496640000 | elapsed time per iteration (s): 1.08 | learning rate: 4.947E-05 | global batch size: 256 | lm loss: 1.934296E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.358 | TFLOPs: 39.06 | 15: iteration 92510/ 125429 | consumed samples: 23682560 | consumed tokens: 48501882880 | elapsed time per iteration (s): 1.03 | learning rate: 4.945E-05 | global batch size: 256 | lm loss: 1.931987E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.191 | TFLOPs: 41.02 | 15: iteration 92520/ 125429 | consumed samples: 23685120 | consumed tokens: 48507125760 | elapsed time per iteration (s): 1.06 | learning rate: 4.943E-05 | global batch size: 256 | lm loss: 1.912478E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.749 | TFLOPs: 39.79 | 15: iteration 92530/ 125429 | consumed samples: 23687680 | consumed tokens: 48512368640 | elapsed time per iteration (s): 1.05 | learning rate: 4.942E-05 | global batch size: 256 | lm loss: 1.917033E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.220 | TFLOPs: 40.19 | 15: iteration 92540/ 125429 | consumed samples: 23690240 | consumed tokens: 48517611520 | elapsed time per iteration (s): 1.05 | learning rate: 4.940E-05 | global batch size: 256 | lm loss: 1.923582E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.245 | TFLOPs: 40.20 | 15: iteration 92550/ 125429 | consumed samples: 23692800 | consumed tokens: 48522854400 | elapsed time per iteration (s): 1.06 | learning rate: 4.938E-05 | global batch size: 256 | lm loss: 1.921272E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.489 | TFLOPs: 39.91 | 15: iteration 92560/ 125429 | consumed samples: 23695360 | consumed tokens: 48528097280 | elapsed time per iteration (s): 1.03 | learning rate: 4.937E-05 | global batch size: 256 | lm loss: 1.944852E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.345 | TFLOPs: 40.88 | 15: iteration 92570/ 125429 | consumed samples: 23697920 | consumed tokens: 48533340160 | elapsed time per iteration (s): 1.03 | learning rate: 4.935E-05 | global batch size: 256 | lm loss: 1.906120E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.105 | TFLOPs: 41.17 | 15: iteration 92580/ 125429 | consumed samples: 23700480 | consumed tokens: 48538583040 | elapsed time per iteration (s): 1.05 | learning rate: 4.933E-05 | global batch size: 256 | lm loss: 1.882537E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.872 | TFLOPs: 40.30 | 15: iteration 92590/ 125429 | consumed samples: 23703040 | consumed tokens: 48543825920 | elapsed time per iteration (s): 1.07 | learning rate: 4.932E-05 | global batch size: 256 | lm loss: 1.924617E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.574 | TFLOPs: 39.59 | 15: iteration 92600/ 125429 | consumed samples: 23705600 | consumed tokens: 48549068800 | elapsed time per iteration (s): 1.08 | learning rate: 4.930E-05 | global batch size: 256 | lm loss: 1.933499E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.169 | TFLOPs: 39.19 | 15: iteration 92610/ 125429 | consumed samples: 23708160 | consumed tokens: 48554311680 | elapsed time per iteration (s): 1.04 | learning rate: 4.928E-05 | global batch size: 256 | lm loss: 1.932888E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.194 | TFLOPs: 40.85 | 15: iteration 92620/ 125429 | consumed samples: 23710720 | consumed tokens: 48559554560 | elapsed time per iteration (s): 1.06 | learning rate: 4.927E-05 | global batch size: 256 | lm loss: 1.923327E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.521 | TFLOPs: 39.91 | 15: iteration 92630/ 125429 | consumed samples: 23713280 | consumed tokens: 48564797440 | elapsed time per iteration (s): 1.04 | learning rate: 4.925E-05 | global batch size: 256 | lm loss: 1.927560E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.000 | TFLOPs: 40.49 | 15: iteration 92640/ 125429 | consumed samples: 23715840 | consumed tokens: 48570040320 | elapsed time per iteration (s): 1.05 | learning rate: 4.923E-05 | global batch size: 256 | lm loss: 1.875277E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.479 | TFLOPs: 40.40 | 15: iteration 92650/ 125429 | consumed samples: 23718400 | consumed tokens: 48575283200 | elapsed time per iteration (s): 1.08 | learning rate: 4.921E-05 | global batch size: 256 | lm loss: 1.910874E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.264 | TFLOPs: 39.04 | 15: iteration 92660/ 125429 | consumed samples: 23720960 | consumed tokens: 48580526080 | elapsed time per iteration (s): 1.06 | learning rate: 4.920E-05 | global batch size: 256 | lm loss: 1.939312E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.431 | TFLOPs: 39.90 | 15: iteration 92670/ 125429 | consumed samples: 23723520 | consumed tokens: 48585768960 | elapsed time per iteration (s): 1.45 | learning rate: 4.918E-05 | global batch size: 256 | lm loss: 1.936772E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.190 | TFLOPs: 29.12 | 15: iteration 92680/ 125429 | consumed samples: 23726080 | consumed tokens: 48591011840 | elapsed time per iteration (s): 1.02 | learning rate: 4.916E-05 | global batch size: 256 | lm loss: 1.906736E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.929 | TFLOPs: 41.47 | 15: iteration 92690/ 125429 | consumed samples: 23728640 | consumed tokens: 48596254720 | elapsed time per iteration (s): 1.04 | learning rate: 4.915E-05 | global batch size: 256 | lm loss: 1.912540E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.779 | TFLOPs: 40.78 | 15: iteration 92700/ 125429 | consumed samples: 23731200 | consumed tokens: 48601497600 | elapsed time per iteration (s): 1.03 | learning rate: 4.913E-05 | global batch size: 256 | lm loss: 1.931526E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.368 | TFLOPs: 41.21 | 15: iteration 92710/ 125429 | consumed samples: 23733760 | consumed tokens: 48606740480 | elapsed time per iteration (s): 1.05 | learning rate: 4.911E-05 | global batch size: 256 | lm loss: 1.934809E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.891 | TFLOPs: 40.30 | 15: iteration 92720/ 125429 | consumed samples: 23736320 | consumed tokens: 48611983360 | elapsed time per iteration (s): 1.04 | learning rate: 4.910E-05 | global batch size: 256 | lm loss: 1.893303E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.572 | TFLOPs: 40.75 | 15: iteration 92730/ 125429 | consumed samples: 23738880 | consumed tokens: 48617226240 | elapsed time per iteration (s): 1.03 | learning rate: 4.908E-05 | global batch size: 256 | lm loss: 1.888097E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.411 | TFLOPs: 41.22 | 15: iteration 92740/ 125429 | consumed samples: 23741440 | consumed tokens: 48622469120 | elapsed time per iteration (s): 1.04 | learning rate: 4.906E-05 | global batch size: 256 | lm loss: 1.908432E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.810 | TFLOPs: 40.79 | 15: iteration 92750/ 125429 | consumed samples: 23744000 | consumed tokens: 48627712000 | elapsed time per iteration (s): 1.04 | learning rate: 4.905E-05 | global batch size: 256 | lm loss: 1.938604E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.980 | TFLOPs: 40.48 | 15: iteration 92760/ 125429 | consumed samples: 23746560 | consumed tokens: 48632954880 | elapsed time per iteration (s): 1.06 | learning rate: 4.903E-05 | global batch size: 256 | lm loss: 1.922383E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.158 | TFLOPs: 40.02 | 15: iteration 92770/ 125429 | consumed samples: 23749120 | consumed tokens: 48638197760 | elapsed time per iteration (s): 1.04 | learning rate: 4.901E-05 | global batch size: 256 | lm loss: 1.918089E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.473 | TFLOPs: 40.73 | 15: iteration 92780/ 125429 | consumed samples: 23751680 | consumed tokens: 48643440640 | elapsed time per iteration (s): 1.04 | learning rate: 4.900E-05 | global batch size: 256 | lm loss: 1.924822E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.460 | TFLOPs: 40.73 | 15: iteration 92790/ 125429 | consumed samples: 23754240 | consumed tokens: 48648683520 | elapsed time per iteration (s): 1.03 | learning rate: 4.898E-05 | global batch size: 256 | lm loss: 1.920565E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.422 | TFLOPs: 41.22 | 15: iteration 92800/ 125429 | consumed samples: 23756800 | consumed tokens: 48653926400 | elapsed time per iteration (s): 1.05 | learning rate: 4.896E-05 | global batch size: 256 | lm loss: 1.920699E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.380 | TFLOPs: 40.22 | 15: iteration 92810/ 125429 | consumed samples: 23759360 | consumed tokens: 48659169280 | elapsed time per iteration (s): 1.09 | learning rate: 4.895E-05 | global batch size: 256 | lm loss: 1.929779E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.119 | TFLOPs: 38.86 | 15: iteration 92820/ 125429 | consumed samples: 23761920 | consumed tokens: 48664412160 | elapsed time per iteration (s): 1.03 | learning rate: 4.893E-05 | global batch size: 256 | lm loss: 1.925578E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.628 | TFLOPs: 41.09 | 15: iteration 92830/ 125429 | consumed samples: 23764480 | consumed tokens: 48669655040 | elapsed time per iteration (s): 1.12 | learning rate: 4.891E-05 | global batch size: 256 | lm loss: 1.908685E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.255 | TFLOPs: 37.89 | 15: iteration 92840/ 125429 | consumed samples: 23767040 | consumed tokens: 48674897920 | elapsed time per iteration (s): 1.06 | learning rate: 4.890E-05 | global batch size: 256 | lm loss: 1.878927E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.058 | TFLOPs: 40.00 | 15: iteration 92850/ 125429 | consumed samples: 23769600 | consumed tokens: 48680140800 | elapsed time per iteration (s): 1.04 | learning rate: 4.888E-05 | global batch size: 256 | lm loss: 1.943378E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.147 | TFLOPs: 40.84 | 15: iteration 92860/ 125429 | consumed samples: 23772160 | consumed tokens: 48685383680 | elapsed time per iteration (s): 1.07 | learning rate: 4.886E-05 | global batch size: 256 | lm loss: 1.937081E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.883 | TFLOPs: 39.64 | 15: iteration 92870/ 125429 | consumed samples: 23774720 | consumed tokens: 48690626560 | elapsed time per iteration (s): 1.08 | learning rate: 4.885E-05 | global batch size: 256 | lm loss: 1.906493E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.488 | TFLOPs: 39.08 | 15: iteration 92880/ 125429 | consumed samples: 23777280 | consumed tokens: 48695869440 | elapsed time per iteration (s): 1.05 | learning rate: 4.883E-05 | global batch size: 256 | lm loss: 1.924147E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.239 | TFLOPs: 40.36 | 15: iteration 92890/ 125429 | consumed samples: 23779840 | consumed tokens: 48701112320 | elapsed time per iteration (s): 1.04 | learning rate: 4.881E-05 | global batch size: 256 | lm loss: 1.896449E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.299 | TFLOPs: 40.70 | 15: iteration 92900/ 125429 | consumed samples: 23782400 | consumed tokens: 48706355200 | elapsed time per iteration (s): 1.04 | learning rate: 4.880E-05 | global batch size: 256 | lm loss: 1.915665E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.698 | TFLOPs: 40.77 | 15: iteration 92910/ 125429 | consumed samples: 23784960 | consumed tokens: 48711598080 | elapsed time per iteration (s): 1.05 | learning rate: 4.878E-05 | global batch size: 256 | lm loss: 1.919582E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.522 | TFLOPs: 40.41 | 15: iteration 92920/ 125429 | consumed samples: 23787520 | consumed tokens: 48716840960 | elapsed time per iteration (s): 1.06 | learning rate: 4.876E-05 | global batch size: 256 | lm loss: 1.919463E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.163 | TFLOPs: 40.02 | 15: iteration 92930/ 125429 | consumed samples: 23790080 | consumed tokens: 48722083840 | elapsed time per iteration (s): 1.07 | learning rate: 4.875E-05 | global batch size: 256 | lm loss: 1.937415E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.162 | TFLOPs: 39.52 | 15: iteration 92940/ 125429 | consumed samples: 23792640 | consumed tokens: 48727326720 | elapsed time per iteration (s): 1.04 | learning rate: 4.873E-05 | global batch size: 256 | lm loss: 1.938231E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.995 | TFLOPs: 40.49 | 15: iteration 92950/ 125429 | consumed samples: 23795200 | consumed tokens: 48732569600 | elapsed time per iteration (s): 1.08 | learning rate: 4.871E-05 | global batch size: 256 | lm loss: 1.927610E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.130 | TFLOPs: 39.19 | 15: iteration 92960/ 125429 | consumed samples: 23797760 | consumed tokens: 48737812480 | elapsed time per iteration (s): 1.03 | learning rate: 4.870E-05 | global batch size: 256 | lm loss: 1.919609E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.765 | TFLOPs: 40.95 | 15: iteration 92970/ 125429 | consumed samples: 23800320 | consumed tokens: 48743055360 | elapsed time per iteration (s): 1.05 | learning rate: 4.868E-05 | global batch size: 256 | lm loss: 1.904495E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.557 | TFLOPs: 40.41 | 15: iteration 92980/ 125429 | consumed samples: 23802880 | consumed tokens: 48748298240 | elapsed time per iteration (s): 1.05 | learning rate: 4.866E-05 | global batch size: 256 | lm loss: 1.946529E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.325 | TFLOPs: 40.38 | 15: iteration 92990/ 125429 | consumed samples: 23805440 | consumed tokens: 48753541120 | elapsed time per iteration (s): 1.03 | learning rate: 4.865E-05 | global batch size: 256 | lm loss: 1.940183E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.411 | TFLOPs: 41.22 | 15: iteration 93000/ 125429 | consumed samples: 23808000 | consumed tokens: 48758784000 | elapsed time per iteration (s): 1.08 | learning rate: 4.863E-05 | global batch size: 256 | lm loss: 1.916785E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.547 | TFLOPs: 39.09 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 93000 | lm loss value: 1.904331E+00 | lm loss PPL: 6.714912E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 93000 to checkpoints_1b5 0: [2022-11-26 23:35:36,201] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step93000 is begin to save! 0: [2022-11-26 23:35:36,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:35:36,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:35:36,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:35:36,595] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:35:36,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:35:36,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:35:36,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:35:36,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:35:36,814] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:35:36,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:35:36,926] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:35:37,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:35:37,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:35:37,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:35:37,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:35:37,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:35:37,262] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:35:37,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:35:37,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:35:37,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:35:37,477] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:35:37,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:35:37,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:35:37,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:35:37,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:35:37,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:35:37,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:35:37,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:35:37,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:35:38,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:35:38,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:35:38,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:35:38,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:35:38,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:35:38,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:35:38,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:35:38,349] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:35:38,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:35:38,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:35:38,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:35:38,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:35:38,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:35:38,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:35:38,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:35:38,777] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:35:38,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:35:38,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:35:38,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:35:38,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:35:39,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:35:39,097] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:35:39,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:35:39,203] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:35:39,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:35:39,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_29-model_00-model_states.pt... 0: [2022-11-26 23:35:39,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_29-model_00-model_states.pt. 0: [2022-11-26 23:35:39,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:35:39,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:35:39,527] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/layer_32-model_00-model_states.pt... 0: [2022-11-26 23:35:39,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/layer_32-model_00-model_states.pt. 0: [2022-11-26 23:35:39,535] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step93000/mp_rank_00_model_states.pt 0: [2022-11-26 23:35:39,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:35:39,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:35:39,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:35:39,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step93000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:35:39,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 23:35:39,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 23:35:39,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,746] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-26 23:35:39,746] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,747] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,747] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 23:35:39,747] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:35:39,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:35:39,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,751] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:35:39,754] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,754] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:35:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 12: [2022-11-26 23:35:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 23:35:39,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 23:35:39,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,757] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:35:39,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:35:39,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 23:35:39,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-26 23:35:39,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 23:35:39,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 0: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:35:39,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 23:35:39,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:35:39,755] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:35:39,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,756] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:35:39,766] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,766] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:35:39,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:35:39,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:35:39,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:35:39,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:35:39,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 12: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:35:39,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,771] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:35:39,771] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,771] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 13: [2022-11-26 23:35:39,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:35:39,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:35:39,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: [2022-11-26 23:35:39,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:35:39,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 8: [2022-11-26 23:35:39,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:35:39,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 23:35:39,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 23:35:39,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 23:35:39,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 12: [2022-11-26 23:35:39,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:35:39,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:35:39,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:35:39,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:35:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:35:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 23:35:39,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 23:35:39,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,782] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 2: [2022-11-26 23:35:39,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:35:39,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-26 23:35:39,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 4: [2022-11-26 23:35:39,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:35:39,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:35:39,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 7: [2022-11-26 23:35:39,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:35:39,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:35:39,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,764] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,767] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,767] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,768] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,768] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,780] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,780] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 14: [2022-11-26 23:35:39,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:35:39,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 23:35:39,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:35:39,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:35:39,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:35:39,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 1: [2022-11-26 23:35:39,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:35:39,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:35:39,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 3: [2022-11-26 23:35:39,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:35:39,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:35:39,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 11: [2022-11-26 23:35:39,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:35:39,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:35:39,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 23:35:39,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 23:35:39,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,803] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,803] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 23:35:39,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,806] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,806] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 23:35:39,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 5: [2022-11-26 23:35:39,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:35:39,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 23:35:39,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 9: [2022-11-26 23:35:39,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:35:39,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 23:35:39,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:35:39,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:35:39,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 6: [2022-11-26 23:35:39,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 15: [2022-11-26 23:35:39,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:35:39,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:35:39,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step93000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 10: [2022-11-26 23:35:40,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step93000 is ready now! 0: successfully saved checkpoint at iteration 93000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3892.23 15: iteration 93010/ 125429 | consumed samples: 23810560 | consumed tokens: 48764026880 | elapsed time per iteration (s): 1.49 | learning rate: 4.861E-05 | global batch size: 256 | lm loss: 1.913332E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.921 | TFLOPs: 28.41 | 15: iteration 93020/ 125429 | consumed samples: 23813120 | consumed tokens: 48769269760 | elapsed time per iteration (s): 1.04 | learning rate: 4.860E-05 | global batch size: 256 | lm loss: 1.926392E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.549 | TFLOPs: 40.58 | 15: iteration 93030/ 125429 | consumed samples: 23815680 | consumed tokens: 48774512640 | elapsed time per iteration (s): 1.04 | learning rate: 4.858E-05 | global batch size: 256 | lm loss: 1.950745E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.146 | TFLOPs: 40.84 | 15: iteration 93040/ 125429 | consumed samples: 23818240 | consumed tokens: 48779755520 | elapsed time per iteration (s): 1.04 | learning rate: 4.856E-05 | global batch size: 256 | lm loss: 1.935129E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.602 | TFLOPs: 40.75 | 15: iteration 93050/ 125429 | consumed samples: 23820800 | consumed tokens: 48784998400 | elapsed time per iteration (s): 1.02 | learning rate: 4.855E-05 | global batch size: 256 | lm loss: 1.893260E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.101 | TFLOPs: 41.50 | 15: iteration 93060/ 125429 | consumed samples: 23823360 | consumed tokens: 48790241280 | elapsed time per iteration (s): 1.10 | learning rate: 4.853E-05 | global batch size: 256 | lm loss: 1.916383E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.555 | TFLOPs: 38.60 | 15: iteration 93070/ 125429 | consumed samples: 23825920 | consumed tokens: 48795484160 | elapsed time per iteration (s): 1.10 | learning rate: 4.851E-05 | global batch size: 256 | lm loss: 1.949525E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.363 | TFLOPs: 38.57 | 15: iteration 93080/ 125429 | consumed samples: 23828480 | consumed tokens: 48800727040 | elapsed time per iteration (s): 1.05 | learning rate: 4.850E-05 | global batch size: 256 | lm loss: 1.961472E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.439 | TFLOPs: 40.23 | 15: iteration 93090/ 125429 | consumed samples: 23831040 | consumed tokens: 48805969920 | elapsed time per iteration (s): 1.05 | learning rate: 4.848E-05 | global batch size: 256 | lm loss: 1.942767E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.774 | TFLOPs: 40.12 | 15: iteration 93100/ 125429 | consumed samples: 23833600 | consumed tokens: 48811212800 | elapsed time per iteration (s): 1.03 | learning rate: 4.846E-05 | global batch size: 256 | lm loss: 1.900402E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.118 | TFLOPs: 41.17 | 15: iteration 93110/ 125429 | consumed samples: 23836160 | consumed tokens: 48816455680 | elapsed time per iteration (s): 1.03 | learning rate: 4.845E-05 | global batch size: 256 | lm loss: 1.929087E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.314 | TFLOPs: 41.20 | 15: iteration 93120/ 125429 | consumed samples: 23838720 | consumed tokens: 48821698560 | elapsed time per iteration (s): 1.08 | learning rate: 4.843E-05 | global batch size: 256 | lm loss: 1.909753E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.296 | TFLOPs: 39.05 | 15: iteration 93130/ 125429 | consumed samples: 23841280 | consumed tokens: 48826941440 | elapsed time per iteration (s): 1.03 | learning rate: 4.841E-05 | global batch size: 256 | lm loss: 1.898463E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.993 | TFLOPs: 40.98 | 15: iteration 93140/ 125429 | consumed samples: 23843840 | consumed tokens: 48832184320 | elapsed time per iteration (s): 1.04 | learning rate: 4.840E-05 | global batch size: 256 | lm loss: 1.910937E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.215 | TFLOPs: 40.85 | 15: iteration 93150/ 125429 | consumed samples: 23846400 | consumed tokens: 48837427200 | elapsed time per iteration (s): 1.09 | learning rate: 4.838E-05 | global batch size: 256 | lm loss: 1.923371E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.671 | TFLOPs: 38.78 | 15: iteration 93160/ 125429 | consumed samples: 23848960 | consumed tokens: 48842670080 | elapsed time per iteration (s): 1.04 | learning rate: 4.836E-05 | global batch size: 256 | lm loss: 1.887667E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.321 | TFLOPs: 40.71 | 15: iteration 93170/ 125429 | consumed samples: 23851520 | consumed tokens: 48847912960 | elapsed time per iteration (s): 1.07 | learning rate: 4.835E-05 | global batch size: 256 | lm loss: 1.892412E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.923 | TFLOPs: 39.65 | 15: iteration 93180/ 125429 | consumed samples: 23854080 | consumed tokens: 48853155840 | elapsed time per iteration (s): 1.03 | learning rate: 4.833E-05 | global batch size: 256 | lm loss: 1.932017E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.172 | TFLOPs: 41.18 | 15: iteration 93190/ 125429 | consumed samples: 23856640 | consumed tokens: 48858398720 | elapsed time per iteration (s): 1.04 | learning rate: 4.831E-05 | global batch size: 256 | lm loss: 1.929409E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.325 | TFLOPs: 40.87 | 15: iteration 93200/ 125429 | consumed samples: 23859200 | consumed tokens: 48863641600 | elapsed time per iteration (s): 1.04 | learning rate: 4.830E-05 | global batch size: 256 | lm loss: 1.923092E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.240 | TFLOPs: 40.53 | 15: iteration 93210/ 125429 | consumed samples: 23861760 | consumed tokens: 48868884480 | elapsed time per iteration (s): 1.03 | learning rate: 4.828E-05 | global batch size: 256 | lm loss: 1.940114E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.056 | TFLOPs: 40.99 | 15: iteration 93220/ 125429 | consumed samples: 23864320 | consumed tokens: 48874127360 | elapsed time per iteration (s): 1.03 | learning rate: 4.826E-05 | global batch size: 256 | lm loss: 1.933609E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.704 | TFLOPs: 41.27 | 15: iteration 93230/ 125429 | consumed samples: 23866880 | consumed tokens: 48879370240 | elapsed time per iteration (s): 1.05 | learning rate: 4.825E-05 | global batch size: 256 | lm loss: 1.930729E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.602 | TFLOPs: 40.42 | 15: iteration 93240/ 125429 | consumed samples: 23869440 | consumed tokens: 48884613120 | elapsed time per iteration (s): 1.06 | learning rate: 4.823E-05 | global batch size: 256 | lm loss: 1.921604E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.580 | TFLOPs: 39.92 | 15: iteration 93250/ 125429 | consumed samples: 23872000 | consumed tokens: 48889856000 | elapsed time per iteration (s): 1.02 | learning rate: 4.821E-05 | global batch size: 256 | lm loss: 1.913038E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.797 | TFLOPs: 41.28 | 15: iteration 93260/ 125429 | consumed samples: 23874560 | consumed tokens: 48895098880 | elapsed time per iteration (s): 1.03 | learning rate: 4.820E-05 | global batch size: 256 | lm loss: 1.924761E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.332 | TFLOPs: 41.20 | 15: iteration 93270/ 125429 | consumed samples: 23877120 | consumed tokens: 48900341760 | elapsed time per iteration (s): 1.02 | learning rate: 4.818E-05 | global batch size: 256 | lm loss: 1.945609E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.539 | TFLOPs: 41.57 | 15: iteration 93280/ 125429 | consumed samples: 23879680 | consumed tokens: 48905584640 | elapsed time per iteration (s): 1.05 | learning rate: 4.816E-05 | global batch size: 256 | lm loss: 1.904568E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.834 | TFLOPs: 40.46 | 15: iteration 93290/ 125429 | consumed samples: 23882240 | consumed tokens: 48910827520 | elapsed time per iteration (s): 1.06 | learning rate: 4.815E-05 | global batch size: 256 | lm loss: 1.922727E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.696 | TFLOPs: 39.78 | 15: iteration 93300/ 125429 | consumed samples: 23884800 | consumed tokens: 48916070400 | elapsed time per iteration (s): 1.05 | learning rate: 4.813E-05 | global batch size: 256 | lm loss: 1.901579E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.841 | TFLOPs: 40.46 | 15: iteration 93310/ 125429 | consumed samples: 23887360 | consumed tokens: 48921313280 | elapsed time per iteration (s): 1.04 | learning rate: 4.812E-05 | global batch size: 256 | lm loss: 1.940750E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.142 | TFLOPs: 40.84 | 15: iteration 93320/ 125429 | consumed samples: 23889920 | consumed tokens: 48926556160 | elapsed time per iteration (s): 1.04 | learning rate: 4.810E-05 | global batch size: 256 | lm loss: 1.902782E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.630 | TFLOPs: 40.76 | 15: iteration 93330/ 125429 | consumed samples: 23892480 | consumed tokens: 48931799040 | elapsed time per iteration (s): 1.02 | learning rate: 4.808E-05 | global batch size: 256 | lm loss: 1.925646E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.989 | TFLOPs: 41.31 | 15: iteration 93340/ 125429 | consumed samples: 23895040 | consumed tokens: 48937041920 | elapsed time per iteration (s): 1.03 | learning rate: 4.807E-05 | global batch size: 256 | lm loss: 1.900615E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.902 | TFLOPs: 40.97 | 15: iteration 93350/ 125429 | consumed samples: 23897600 | consumed tokens: 48942284800 | elapsed time per iteration (s): 1.05 | learning rate: 4.805E-05 | global batch size: 256 | lm loss: 1.921064E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.819 | TFLOPs: 40.46 | 15: iteration 93360/ 125429 | consumed samples: 23900160 | consumed tokens: 48947527680 | elapsed time per iteration (s): 1.05 | learning rate: 4.803E-05 | global batch size: 256 | lm loss: 1.931055E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.854 | TFLOPs: 40.46 | 15: iteration 93370/ 125429 | consumed samples: 23902720 | consumed tokens: 48952770560 | elapsed time per iteration (s): 1.05 | learning rate: 4.802E-05 | global batch size: 256 | lm loss: 1.927872E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.607 | TFLOPs: 40.42 | 15: iteration 93380/ 125429 | consumed samples: 23905280 | consumed tokens: 48958013440 | elapsed time per iteration (s): 1.07 | learning rate: 4.800E-05 | global batch size: 256 | lm loss: 1.927670E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.145 | TFLOPs: 39.52 | 15: iteration 93390/ 125429 | consumed samples: 23907840 | consumed tokens: 48963256320 | elapsed time per iteration (s): 1.03 | learning rate: 4.798E-05 | global batch size: 256 | lm loss: 1.912083E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.191 | TFLOPs: 41.02 | 15: iteration 93400/ 125429 | consumed samples: 23910400 | consumed tokens: 48968499200 | elapsed time per iteration (s): 1.03 | learning rate: 4.797E-05 | global batch size: 256 | lm loss: 1.911718E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.003 | TFLOPs: 40.98 | 15: iteration 93410/ 125429 | consumed samples: 23912960 | consumed tokens: 48973742080 | elapsed time per iteration (s): 1.13 | learning rate: 4.795E-05 | global batch size: 256 | lm loss: 1.932024E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.542 | TFLOPs: 37.44 | 15: iteration 93420/ 125429 | consumed samples: 23915520 | consumed tokens: 48978984960 | elapsed time per iteration (s): 1.05 | learning rate: 4.793E-05 | global batch size: 256 | lm loss: 1.912119E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.179 | TFLOPs: 40.35 | 15: iteration 93430/ 125429 | consumed samples: 23918080 | consumed tokens: 48984227840 | elapsed time per iteration (s): 1.02 | learning rate: 4.792E-05 | global batch size: 256 | lm loss: 1.921189E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.754 | TFLOPs: 41.44 | 15: iteration 93440/ 125429 | consumed samples: 23920640 | consumed tokens: 48989470720 | elapsed time per iteration (s): 1.02 | learning rate: 4.790E-05 | global batch size: 256 | lm loss: 1.929906E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.132 | TFLOPs: 41.50 | 15: iteration 93450/ 125429 | consumed samples: 23923200 | consumed tokens: 48994713600 | elapsed time per iteration (s): 1.05 | learning rate: 4.788E-05 | global batch size: 256 | lm loss: 1.909369E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.803 | TFLOPs: 40.13 | 15: iteration 93460/ 125429 | consumed samples: 23925760 | consumed tokens: 48999956480 | elapsed time per iteration (s): 1.02 | learning rate: 4.787E-05 | global batch size: 256 | lm loss: 1.917401E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.496 | TFLOPs: 41.56 | 15: iteration 93470/ 125429 | consumed samples: 23928320 | consumed tokens: 49005199360 | elapsed time per iteration (s): 1.04 | learning rate: 4.785E-05 | global batch size: 256 | lm loss: 1.893313E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.264 | TFLOPs: 40.53 | 15: iteration 93480/ 125429 | consumed samples: 23930880 | consumed tokens: 49010442240 | elapsed time per iteration (s): 1.04 | learning rate: 4.783E-05 | global batch size: 256 | lm loss: 1.927239E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.289 | TFLOPs: 40.70 | 15: iteration 93490/ 125429 | consumed samples: 23933440 | consumed tokens: 49015685120 | elapsed time per iteration (s): 1.05 | learning rate: 4.782E-05 | global batch size: 256 | lm loss: 1.927471E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.563 | TFLOPs: 40.25 | 15: iteration 93500/ 125429 | consumed samples: 23936000 | consumed tokens: 49020928000 | elapsed time per iteration (s): 1.02 | learning rate: 4.780E-05 | global batch size: 256 | lm loss: 1.946565E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.886 | TFLOPs: 41.30 | 15: iteration 93510/ 125429 | consumed samples: 23938560 | consumed tokens: 49026170880 | elapsed time per iteration (s): 1.05 | learning rate: 4.779E-05 | global batch size: 256 | lm loss: 1.891610E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.824 | TFLOPs: 40.29 | 15: iteration 93520/ 125429 | consumed samples: 23941120 | consumed tokens: 49031413760 | elapsed time per iteration (s): 1.02 | learning rate: 4.777E-05 | global batch size: 256 | lm loss: 1.905755E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.368 | TFLOPs: 41.38 | 15: iteration 93530/ 125429 | consumed samples: 23943680 | consumed tokens: 49036656640 | elapsed time per iteration (s): 1.03 | learning rate: 4.775E-05 | global batch size: 256 | lm loss: 1.922104E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.648 | TFLOPs: 41.09 | 15: iteration 93540/ 125429 | consumed samples: 23946240 | consumed tokens: 49041899520 | elapsed time per iteration (s): 1.04 | learning rate: 4.774E-05 | global batch size: 256 | lm loss: 1.913813E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.956 | TFLOPs: 40.65 | 15: iteration 93550/ 125429 | consumed samples: 23948800 | consumed tokens: 49047142400 | elapsed time per iteration (s): 1.05 | learning rate: 4.772E-05 | global batch size: 256 | lm loss: 1.950928E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.713 | TFLOPs: 40.44 | 15: iteration 93560/ 125429 | consumed samples: 23951360 | consumed tokens: 49052385280 | elapsed time per iteration (s): 1.03 | learning rate: 4.770E-05 | global batch size: 256 | lm loss: 1.962034E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.337 | TFLOPs: 41.04 | 15: iteration 93570/ 125429 | consumed samples: 23953920 | consumed tokens: 49057628160 | elapsed time per iteration (s): 1.02 | learning rate: 4.769E-05 | global batch size: 256 | lm loss: 1.924786E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.866 | TFLOPs: 41.29 | 15: iteration 93580/ 125429 | consumed samples: 23956480 | consumed tokens: 49062871040 | elapsed time per iteration (s): 1.07 | learning rate: 4.767E-05 | global batch size: 256 | lm loss: 1.934188E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.114 | TFLOPs: 39.52 | 15: iteration 93590/ 125429 | consumed samples: 23959040 | consumed tokens: 49068113920 | elapsed time per iteration (s): 1.08 | learning rate: 4.765E-05 | global batch size: 256 | lm loss: 1.952297E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.056 | TFLOPs: 39.34 | 15: iteration 93600/ 125429 | consumed samples: 23961600 | consumed tokens: 49073356800 | elapsed time per iteration (s): 1.03 | learning rate: 4.764E-05 | global batch size: 256 | lm loss: 1.921312E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.361 | TFLOPs: 41.04 | 15: iteration 93610/ 125429 | consumed samples: 23964160 | consumed tokens: 49078599680 | elapsed time per iteration (s): 1.03 | learning rate: 4.762E-05 | global batch size: 256 | lm loss: 1.897934E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.845 | TFLOPs: 40.96 | 15: iteration 93620/ 125429 | consumed samples: 23966720 | consumed tokens: 49083842560 | elapsed time per iteration (s): 1.04 | learning rate: 4.760E-05 | global batch size: 256 | lm loss: 1.901091E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.928 | TFLOPs: 40.81 | 15: iteration 93630/ 125429 | consumed samples: 23969280 | consumed tokens: 49089085440 | elapsed time per iteration (s): 1.03 | learning rate: 4.759E-05 | global batch size: 256 | lm loss: 1.914279E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.324 | TFLOPs: 41.04 | 15: iteration 93640/ 125429 | consumed samples: 23971840 | consumed tokens: 49094328320 | elapsed time per iteration (s): 1.05 | learning rate: 4.757E-05 | global batch size: 256 | lm loss: 1.911189E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.906 | TFLOPs: 40.47 | 15: iteration 93650/ 125429 | consumed samples: 23974400 | consumed tokens: 49099571200 | elapsed time per iteration (s): 1.02 | learning rate: 4.756E-05 | global batch size: 256 | lm loss: 1.913091E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.619 | TFLOPs: 41.42 | 15: iteration 93660/ 125429 | consumed samples: 23976960 | consumed tokens: 49104814080 | elapsed time per iteration (s): 1.04 | learning rate: 4.754E-05 | global batch size: 256 | lm loss: 1.893321E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.231 | TFLOPs: 40.53 | 15: iteration 93670/ 125429 | consumed samples: 23979520 | consumed tokens: 49110056960 | elapsed time per iteration (s): 1.10 | learning rate: 4.752E-05 | global batch size: 256 | lm loss: 1.889088E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.374 | TFLOPs: 38.40 | 15: iteration 93680/ 125429 | consumed samples: 23982080 | consumed tokens: 49115299840 | elapsed time per iteration (s): 1.07 | learning rate: 4.751E-05 | global batch size: 256 | lm loss: 1.927072E+00 | grad norm: 3.676 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.520 | TFLOPs: 39.58 | 15: iteration 93690/ 125429 | consumed samples: 23984640 | consumed tokens: 49120542720 | elapsed time per iteration (s): 1.04 | learning rate: 4.749E-05 | global batch size: 256 | lm loss: 1.930644E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.527 | TFLOPs: 40.58 | 15: iteration 93700/ 125429 | consumed samples: 23987200 | consumed tokens: 49125785600 | elapsed time per iteration (s): 1.04 | learning rate: 4.747E-05 | global batch size: 256 | lm loss: 1.917929E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.648 | TFLOPs: 40.76 | 15: iteration 93710/ 125429 | consumed samples: 23989760 | consumed tokens: 49131028480 | elapsed time per iteration (s): 1.11 | learning rate: 4.746E-05 | global batch size: 256 | lm loss: 1.931654E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.294 | TFLOPs: 38.22 | 15: iteration 93720/ 125429 | consumed samples: 23992320 | consumed tokens: 49136271360 | elapsed time per iteration (s): 1.07 | learning rate: 4.744E-05 | global batch size: 256 | lm loss: 1.930337E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.976 | TFLOPs: 39.66 | 15: iteration 93730/ 125429 | consumed samples: 23994880 | consumed tokens: 49141514240 | elapsed time per iteration (s): 1.04 | learning rate: 4.742E-05 | global batch size: 256 | lm loss: 1.924346E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.713 | TFLOPs: 40.61 | 15: iteration 93740/ 125429 | consumed samples: 23997440 | consumed tokens: 49146757120 | elapsed time per iteration (s): 1.03 | learning rate: 4.741E-05 | global batch size: 256 | lm loss: 1.927860E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.185 | TFLOPs: 41.01 | 15: iteration 93750/ 125429 | consumed samples: 24000000 | consumed tokens: 49152000000 | elapsed time per iteration (s): 1.03 | learning rate: 4.739E-05 | global batch size: 256 | lm loss: 1.911622E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.556 | TFLOPs: 41.08 | 15: iteration 93760/ 125429 | consumed samples: 24002560 | consumed tokens: 49157242880 | elapsed time per iteration (s): 1.04 | learning rate: 4.738E-05 | global batch size: 256 | lm loss: 1.912383E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.590 | TFLOPs: 40.75 | 15: iteration 93770/ 125429 | consumed samples: 24005120 | consumed tokens: 49162485760 | elapsed time per iteration (s): 1.06 | learning rate: 4.736E-05 | global batch size: 256 | lm loss: 1.889442E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.550 | TFLOPs: 40.08 | 15: iteration 93780/ 125429 | consumed samples: 24007680 | consumed tokens: 49167728640 | elapsed time per iteration (s): 1.04 | learning rate: 4.734E-05 | global batch size: 256 | lm loss: 1.935942E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.217 | TFLOPs: 40.69 | 15: iteration 93790/ 125429 | consumed samples: 24010240 | consumed tokens: 49172971520 | elapsed time per iteration (s): 1.05 | learning rate: 4.733E-05 | global batch size: 256 | lm loss: 1.923656E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.599 | TFLOPs: 40.42 | 15: iteration 93800/ 125429 | consumed samples: 24012800 | consumed tokens: 49178214400 | elapsed time per iteration (s): 1.03 | learning rate: 4.731E-05 | global batch size: 256 | lm loss: 1.907935E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.744 | TFLOPs: 40.94 | 15: iteration 93810/ 125429 | consumed samples: 24015360 | consumed tokens: 49183457280 | elapsed time per iteration (s): 1.04 | learning rate: 4.729E-05 | global batch size: 256 | lm loss: 1.918055E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.410 | TFLOPs: 40.72 | 15: iteration 93820/ 125429 | consumed samples: 24017920 | consumed tokens: 49188700160 | elapsed time per iteration (s): 1.03 | learning rate: 4.728E-05 | global batch size: 256 | lm loss: 1.931899E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.104 | TFLOPs: 41.17 | 15: iteration 93830/ 125429 | consumed samples: 24020480 | consumed tokens: 49193943040 | elapsed time per iteration (s): 1.05 | learning rate: 4.726E-05 | global batch size: 256 | lm loss: 1.955370E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.268 | TFLOPs: 40.20 | 15: iteration 93840/ 125429 | consumed samples: 24023040 | consumed tokens: 49199185920 | elapsed time per iteration (s): 1.05 | learning rate: 4.724E-05 | global batch size: 256 | lm loss: 1.902980E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.718 | TFLOPs: 40.28 | 15: iteration 93850/ 125429 | consumed samples: 24025600 | consumed tokens: 49204428800 | elapsed time per iteration (s): 1.04 | learning rate: 4.723E-05 | global batch size: 256 | lm loss: 1.910002E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.711 | TFLOPs: 40.77 | 15: iteration 93860/ 125429 | consumed samples: 24028160 | consumed tokens: 49209671680 | elapsed time per iteration (s): 1.06 | learning rate: 4.721E-05 | global batch size: 256 | lm loss: 1.929689E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.835 | TFLOPs: 39.97 | 15: iteration 93870/ 125429 | consumed samples: 24030720 | consumed tokens: 49214914560 | elapsed time per iteration (s): 1.04 | learning rate: 4.720E-05 | global batch size: 256 | lm loss: 1.907917E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.201 | TFLOPs: 40.52 | 15: iteration 93880/ 125429 | consumed samples: 24033280 | consumed tokens: 49220157440 | elapsed time per iteration (s): 1.04 | learning rate: 4.718E-05 | global batch size: 256 | lm loss: 1.927097E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.320 | TFLOPs: 40.71 | 15: iteration 93890/ 125429 | consumed samples: 24035840 | consumed tokens: 49225400320 | elapsed time per iteration (s): 1.07 | learning rate: 4.716E-05 | global batch size: 256 | lm loss: 1.897205E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.015 | TFLOPs: 39.66 | 15: iteration 93900/ 125429 | consumed samples: 24038400 | consumed tokens: 49230643200 | elapsed time per iteration (s): 1.06 | learning rate: 4.715E-05 | global batch size: 256 | lm loss: 1.910966E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.608 | TFLOPs: 39.93 | 15: iteration 93910/ 125429 | consumed samples: 24040960 | consumed tokens: 49235886080 | elapsed time per iteration (s): 1.03 | learning rate: 4.713E-05 | global batch size: 256 | lm loss: 1.918651E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.541 | TFLOPs: 40.91 | 15: iteration 93920/ 125429 | consumed samples: 24043520 | consumed tokens: 49241128960 | elapsed time per iteration (s): 1.02 | learning rate: 4.711E-05 | global batch size: 256 | lm loss: 1.922253E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.176 | TFLOPs: 41.34 | 15: iteration 93930/ 125429 | consumed samples: 24046080 | consumed tokens: 49246371840 | elapsed time per iteration (s): 1.04 | learning rate: 4.710E-05 | global batch size: 256 | lm loss: 1.915033E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.315 | TFLOPs: 40.71 | 15: iteration 93940/ 125429 | consumed samples: 24048640 | consumed tokens: 49251614720 | elapsed time per iteration (s): 1.05 | learning rate: 4.708E-05 | global batch size: 256 | lm loss: 1.914906E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.232 | TFLOPs: 40.36 | 15: iteration 93950/ 125429 | consumed samples: 24051200 | consumed tokens: 49256857600 | elapsed time per iteration (s): 1.07 | learning rate: 4.707E-05 | global batch size: 256 | lm loss: 1.894311E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.510 | TFLOPs: 39.42 | 15: iteration 93960/ 125429 | consumed samples: 24053760 | consumed tokens: 49262100480 | elapsed time per iteration (s): 1.02 | learning rate: 4.705E-05 | global batch size: 256 | lm loss: 1.905403E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.903 | TFLOPs: 41.30 | 15: iteration 93970/ 125429 | consumed samples: 24056320 | consumed tokens: 49267343360 | elapsed time per iteration (s): 1.05 | learning rate: 4.703E-05 | global batch size: 256 | lm loss: 1.940162E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.090 | TFLOPs: 40.34 | 15: iteration 93980/ 125429 | consumed samples: 24058880 | consumed tokens: 49272586240 | elapsed time per iteration (s): 1.03 | learning rate: 4.702E-05 | global batch size: 256 | lm loss: 1.916114E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.152 | TFLOPs: 41.17 | 15: iteration 93990/ 125429 | consumed samples: 24061440 | consumed tokens: 49277829120 | elapsed time per iteration (s): 1.06 | learning rate: 4.700E-05 | global batch size: 256 | lm loss: 1.916330E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.886 | TFLOPs: 39.97 | 0: [2022-11-26 23:53:05,416] [INFO] [logging.py:68:log_dist] [Rank 0] step=94000, skipped=0, lr=[4.6983870031384015e-05, 4.6983870031384015e-05, 4.6983870031384015e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 94000/ 125429 | consumed samples: 24064000 | consumed tokens: 49283072000 | elapsed time per iteration (s): 1.06 | learning rate: 4.698E-05 | global batch size: 256 | lm loss: 1.926492E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.075 | TFLOPs: 39.84 | 0: steps: 94000 loss: 1.9449 iter time (s): 1.051 samples/sec: 243.629 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 94000 | lm loss value: 1.894215E+00 | lm loss PPL: 6.647326E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 94000 to checkpoints_1b5 0: [2022-11-26 23:53:05,799] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step94000 is begin to save! 0: [2022-11-26 23:53:05,804] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_01-model_00-model_states.pt... 0: [2022-11-26 23:53:06,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_01-model_00-model_states.pt. 0: [2022-11-26 23:53:06,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_03-model_00-model_states.pt... 0: [2022-11-26 23:53:06,165] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_03-model_00-model_states.pt. 0: [2022-11-26 23:53:06,166] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_04-model_00-model_states.pt... 0: [2022-11-26 23:53:06,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_04-model_00-model_states.pt. 0: [2022-11-26 23:53:06,277] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_05-model_00-model_states.pt... 0: [2022-11-26 23:53:06,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_05-model_00-model_states.pt. 0: [2022-11-26 23:53:06,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_06-model_00-model_states.pt... 0: [2022-11-26 23:53:06,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_06-model_00-model_states.pt. 0: [2022-11-26 23:53:06,498] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_07-model_00-model_states.pt... 0: [2022-11-26 23:53:06,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_07-model_00-model_states.pt. 0: [2022-11-26 23:53:06,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_08-model_00-model_states.pt... 0: [2022-11-26 23:53:06,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_08-model_00-model_states.pt. 0: [2022-11-26 23:53:06,721] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_09-model_00-model_states.pt... 0: [2022-11-26 23:53:06,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_09-model_00-model_states.pt. 0: [2022-11-26 23:53:06,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_10-model_00-model_states.pt... 0: [2022-11-26 23:53:06,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_10-model_00-model_states.pt. 0: [2022-11-26 23:53:06,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_11-model_00-model_states.pt... 0: [2022-11-26 23:53:07,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_11-model_00-model_states.pt. 0: [2022-11-26 23:53:07,073] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_12-model_00-model_states.pt... 0: [2022-11-26 23:53:07,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_12-model_00-model_states.pt. 0: [2022-11-26 23:53:07,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_13-model_00-model_states.pt... 0: [2022-11-26 23:53:07,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_13-model_00-model_states.pt. 0: [2022-11-26 23:53:07,300] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_14-model_00-model_states.pt... 0: [2022-11-26 23:53:07,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_14-model_00-model_states.pt. 0: [2022-11-26 23:53:07,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_15-model_00-model_states.pt... 0: [2022-11-26 23:53:07,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_15-model_00-model_states.pt. 0: [2022-11-26 23:53:07,524] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_16-model_00-model_states.pt... 0: [2022-11-26 23:53:07,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_16-model_00-model_states.pt. 0: [2022-11-26 23:53:07,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_17-model_00-model_states.pt... 0: [2022-11-26 23:53:07,751] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_17-model_00-model_states.pt. 0: [2022-11-26 23:53:07,751] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_18-model_00-model_states.pt... 0: [2022-11-26 23:53:07,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_18-model_00-model_states.pt. 0: [2022-11-26 23:53:07,864] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_19-model_00-model_states.pt... 0: [2022-11-26 23:53:07,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_19-model_00-model_states.pt. 0: [2022-11-26 23:53:07,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_20-model_00-model_states.pt... 0: [2022-11-26 23:53:08,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_20-model_00-model_states.pt. 0: [2022-11-26 23:53:08,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_21-model_00-model_states.pt... 0: [2022-11-26 23:53:08,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_21-model_00-model_states.pt. 0: [2022-11-26 23:53:08,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_22-model_00-model_states.pt... 0: [2022-11-26 23:53:08,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_22-model_00-model_states.pt. 0: [2022-11-26 23:53:08,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_23-model_00-model_states.pt... 0: [2022-11-26 23:53:08,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_23-model_00-model_states.pt. 0: [2022-11-26 23:53:08,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_24-model_00-model_states.pt... 0: [2022-11-26 23:53:08,538] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_24-model_00-model_states.pt. 0: [2022-11-26 23:53:08,538] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_25-model_00-model_states.pt... 0: [2022-11-26 23:53:08,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_25-model_00-model_states.pt. 0: [2022-11-26 23:53:08,652] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_26-model_00-model_states.pt... 0: [2022-11-26 23:53:08,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_26-model_00-model_states.pt. 0: [2022-11-26 23:53:08,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_27-model_00-model_states.pt... 0: [2022-11-26 23:53:08,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_27-model_00-model_states.pt. 0: [2022-11-26 23:53:08,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_28-model_00-model_states.pt... 0: [2022-11-26 23:53:08,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_28-model_00-model_states.pt. 0: [2022-11-26 23:53:08,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_29-model_00-model_states.pt... 0: [2022-11-26 23:53:09,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_29-model_00-model_states.pt. 0: [2022-11-26 23:53:09,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_30-model_00-model_states.pt... 0: [2022-11-26 23:53:09,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_30-model_00-model_states.pt. 0: [2022-11-26 23:53:09,191] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/layer_32-model_00-model_states.pt... 0: [2022-11-26 23:53:09,195] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/layer_32-model_00-model_states.pt. 0: [2022-11-26 23:53:09,197] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step94000/mp_rank_00_model_states.pt 0: [2022-11-26 23:53:09,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/mp_rank_00_model_states.pt... 0: [2022-11-26 23:53:09,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/mp_rank_00_model_states.pt. 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 11: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 1: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 4: [2022-11-26 23:53:09,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 13: [2022-11-26 23:53:09,237] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step94000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 14: [2022-11-26 23:53:09,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:53:09,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:53:09,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:09,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:53:09,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-26 23:53:09,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:53:09,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 23:53:09,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:09,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 23:53:09,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 23:53:09,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:09,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 23:53:09,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:09,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 23:53:09,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:09,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 23:53:09,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:09,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 23:53:09,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 23:53:09,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:09,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 3: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 23:53:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:09,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:09,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:09,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 3: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 3: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 23:53:09,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 23:53:09,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 23:53:09,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:09,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:09,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 23:53:09,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 23:53:09,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:53:09,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 10: [2022-11-26 23:53:09,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-26 23:53:09,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-26 23:53:09,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 23:53:09,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 8: [2022-11-26 23:53:09,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:09,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:09,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:09,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:53:09,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 6: [2022-11-26 23:53:09,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-26 23:53:09,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-26 23:53:09,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:53:09,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:53:09,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:09,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:09,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:09,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:53:09,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:53:09,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:53:09,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-26 23:53:09,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-26 23:53:09,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 11: [2022-11-26 23:53:09,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 23:53:09,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-26 23:53:09,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 4: [2022-11-26 23:53:09,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-26 23:53:09,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-26 23:53:09,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-26 23:53:09,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-26 23:53:09,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-26 23:53:09,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-26 23:53:09,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-26 23:53:09,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:09,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 14: [2022-11-26 23:53:09,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 23:53:09,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 15: [2022-11-26 23:53:09,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-26 23:53:09,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-26 23:53:09,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:09,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 23:53:09,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-26 23:53:09,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-26 23:53:09,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 13: [2022-11-26 23:53:09,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 1: [2022-11-26 23:53:09,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-26 23:53:09,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-26 23:53:09,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 2: [2022-11-26 23:53:09,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-26 23:53:09,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-26 23:53:09,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 5: [2022-11-26 23:53:09,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-26 23:53:09,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-26 23:53:09,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-26 23:53:09,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-26 23:53:09,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 7: [2022-11-26 23:53:09,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: [2022-11-26 23:53:09,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-26 23:53:09,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,518] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 9: [2022-11-26 23:53:09,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-26 23:53:09,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-26 23:53:09,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-26 23:53:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step94000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 12: [2022-11-26 23:53:09,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step94000 is ready now! 0: successfully saved checkpoint at iteration 94000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3836.16 15: iteration 94010/ 125429 | consumed samples: 24066560 | consumed tokens: 49288314880 | elapsed time per iteration (s): 1.44 | learning rate: 4.697E-05 | global batch size: 256 | lm loss: 1.923394E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.940 | TFLOPs: 29.41 | 15: iteration 94020/ 125429 | consumed samples: 24069120 | consumed tokens: 49293557760 | elapsed time per iteration (s): 1.07 | learning rate: 4.695E-05 | global batch size: 256 | lm loss: 1.911498E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.311 | TFLOPs: 39.38 | 15: iteration 94030/ 125429 | consumed samples: 24071680 | consumed tokens: 49298800640 | elapsed time per iteration (s): 1.03 | learning rate: 4.694E-05 | global batch size: 256 | lm loss: 1.945858E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.711 | TFLOPs: 40.94 | 15: iteration 94040/ 125429 | consumed samples: 24074240 | consumed tokens: 49304043520 | elapsed time per iteration (s): 1.05 | learning rate: 4.692E-05 | global batch size: 256 | lm loss: 1.935805E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.688 | TFLOPs: 40.27 | 15: iteration 94050/ 125429 | consumed samples: 24076800 | consumed tokens: 49309286400 | elapsed time per iteration (s): 1.17 | learning rate: 4.690E-05 | global batch size: 256 | lm loss: 1.896543E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.966 | TFLOPs: 36.02 | 15: iteration 94060/ 125429 | consumed samples: 24079360 | consumed tokens: 49314529280 | elapsed time per iteration (s): 1.03 | learning rate: 4.689E-05 | global batch size: 256 | lm loss: 1.891505E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.131 | TFLOPs: 41.01 | 15: iteration 94070/ 125429 | consumed samples: 24081920 | consumed tokens: 49319772160 | elapsed time per iteration (s): 1.04 | learning rate: 4.687E-05 | global batch size: 256 | lm loss: 1.930340E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.847 | TFLOPs: 40.79 | 15: iteration 94080/ 125429 | consumed samples: 24084480 | consumed tokens: 49325015040 | elapsed time per iteration (s): 1.05 | learning rate: 4.685E-05 | global batch size: 256 | lm loss: 1.903057E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.701 | TFLOPs: 40.11 | 15: iteration 94090/ 125429 | consumed samples: 24087040 | consumed tokens: 49330257920 | elapsed time per iteration (s): 1.05 | learning rate: 4.684E-05 | global batch size: 256 | lm loss: 1.931640E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.791 | TFLOPs: 40.45 | 15: iteration 94100/ 125429 | consumed samples: 24089600 | consumed tokens: 49335500800 | elapsed time per iteration (s): 1.05 | learning rate: 4.682E-05 | global batch size: 256 | lm loss: 1.930463E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.350 | TFLOPs: 40.22 | 15: iteration 94110/ 125429 | consumed samples: 24092160 | consumed tokens: 49340743680 | elapsed time per iteration (s): 1.03 | learning rate: 4.681E-05 | global batch size: 256 | lm loss: 1.900767E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.517 | TFLOPs: 41.07 | 15: iteration 94120/ 125429 | consumed samples: 24094720 | consumed tokens: 49345986560 | elapsed time per iteration (s): 1.05 | learning rate: 4.679E-05 | global batch size: 256 | lm loss: 1.902069E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.168 | TFLOPs: 40.19 | 15: iteration 94130/ 125429 | consumed samples: 24097280 | consumed tokens: 49351229440 | elapsed time per iteration (s): 1.03 | learning rate: 4.677E-05 | global batch size: 256 | lm loss: 1.916125E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.753 | TFLOPs: 41.27 | 15: iteration 94140/ 125429 | consumed samples: 24099840 | consumed tokens: 49356472320 | elapsed time per iteration (s): 1.05 | learning rate: 4.676E-05 | global batch size: 256 | lm loss: 1.916868E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.572 | TFLOPs: 40.42 | 15: iteration 94150/ 125429 | consumed samples: 24102400 | consumed tokens: 49361715200 | elapsed time per iteration (s): 1.05 | learning rate: 4.674E-05 | global batch size: 256 | lm loss: 1.914106E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.997 | TFLOPs: 40.32 | 15: iteration 94160/ 125429 | consumed samples: 24104960 | consumed tokens: 49366958080 | elapsed time per iteration (s): 1.03 | learning rate: 4.672E-05 | global batch size: 256 | lm loss: 1.907509E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.487 | TFLOPs: 41.06 | 15: iteration 94170/ 125429 | consumed samples: 24107520 | consumed tokens: 49372200960 | elapsed time per iteration (s): 1.03 | learning rate: 4.671E-05 | global batch size: 256 | lm loss: 1.929404E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.687 | TFLOPs: 41.10 | 15: iteration 94180/ 125429 | consumed samples: 24110080 | consumed tokens: 49377443840 | elapsed time per iteration (s): 1.02 | learning rate: 4.669E-05 | global batch size: 256 | lm loss: 1.919905E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.939 | TFLOPs: 41.30 | 15: iteration 94190/ 125429 | consumed samples: 24112640 | consumed tokens: 49382686720 | elapsed time per iteration (s): 1.04 | learning rate: 4.668E-05 | global batch size: 256 | lm loss: 1.921618E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.087 | TFLOPs: 40.67 | 15: iteration 94200/ 125429 | consumed samples: 24115200 | consumed tokens: 49387929600 | elapsed time per iteration (s): 1.04 | learning rate: 4.666E-05 | global batch size: 256 | lm loss: 1.912299E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.296 | TFLOPs: 40.87 | 15: iteration 94210/ 125429 | consumed samples: 24117760 | consumed tokens: 49393172480 | elapsed time per iteration (s): 1.04 | learning rate: 4.664E-05 | global batch size: 256 | lm loss: 1.922009E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.687 | TFLOPs: 40.77 | 15: iteration 94220/ 125429 | consumed samples: 24120320 | consumed tokens: 49398415360 | elapsed time per iteration (s): 1.02 | learning rate: 4.663E-05 | global batch size: 256 | lm loss: 1.914604E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.565 | TFLOPs: 41.41 | 15: iteration 94230/ 125429 | consumed samples: 24122880 | consumed tokens: 49403658240 | elapsed time per iteration (s): 1.03 | learning rate: 4.661E-05 | global batch size: 256 | lm loss: 1.901165E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.271 | TFLOPs: 41.19 | 15: iteration 94240/ 125429 | consumed samples: 24125440 | consumed tokens: 49408901120 | elapsed time per iteration (s): 1.06 | learning rate: 4.659E-05 | global batch size: 256 | lm loss: 1.925495E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.402 | TFLOPs: 39.89 | 15: iteration 94250/ 125429 | consumed samples: 24128000 | consumed tokens: 49414144000 | elapsed time per iteration (s): 1.02 | learning rate: 4.658E-05 | global batch size: 256 | lm loss: 1.904618E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.765 | TFLOPs: 41.28 | 15: iteration 94260/ 125429 | consumed samples: 24130560 | consumed tokens: 49419386880 | elapsed time per iteration (s): 1.06 | learning rate: 4.656E-05 | global batch size: 256 | lm loss: 1.940988E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.490 | TFLOPs: 39.91 | 15: iteration 94270/ 125429 | consumed samples: 24133120 | consumed tokens: 49424629760 | elapsed time per iteration (s): 1.07 | learning rate: 4.655E-05 | global batch size: 256 | lm loss: 1.908526E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.030 | TFLOPs: 39.67 | 15: iteration 94280/ 125429 | consumed samples: 24135680 | consumed tokens: 49429872640 | elapsed time per iteration (s): 1.08 | learning rate: 4.653E-05 | global batch size: 256 | lm loss: 1.899397E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.843 | TFLOPs: 39.14 | 15: iteration 94290/ 125429 | consumed samples: 24138240 | consumed tokens: 49435115520 | elapsed time per iteration (s): 1.07 | learning rate: 4.651E-05 | global batch size: 256 | lm loss: 1.905319E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.729 | TFLOPs: 39.45 | 15: iteration 94300/ 125429 | consumed samples: 24140800 | consumed tokens: 49440358400 | elapsed time per iteration (s): 1.02 | learning rate: 4.650E-05 | global batch size: 256 | lm loss: 1.914004E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.436 | TFLOPs: 41.39 | 15: iteration 94310/ 125429 | consumed samples: 24143360 | consumed tokens: 49445601280 | elapsed time per iteration (s): 1.10 | learning rate: 4.648E-05 | global batch size: 256 | lm loss: 1.924143E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.247 | TFLOPs: 38.38 | 15: iteration 94320/ 125429 | consumed samples: 24145920 | consumed tokens: 49450844160 | elapsed time per iteration (s): 1.02 | learning rate: 4.647E-05 | global batch size: 256 | lm loss: 1.908137E+00 | grad norm: 0.140 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.799 | TFLOPs: 41.45 | 15: iteration 94330/ 125429 | consumed samples: 24148480 | consumed tokens: 49456087040 | elapsed time per iteration (s): 1.04 | learning rate: 4.645E-05 | global batch size: 256 | lm loss: 1.926017E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.089 | TFLOPs: 40.83 | 15: iteration 94340/ 125429 | consumed samples: 24151040 | consumed tokens: 49461329920 | elapsed time per iteration (s): 1.05 | learning rate: 4.643E-05 | global batch size: 256 | lm loss: 1.917901E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.759 | TFLOPs: 40.12 | 15: iteration 94350/ 125429 | consumed samples: 24153600 | consumed tokens: 49466572800 | elapsed time per iteration (s): 1.04 | learning rate: 4.642E-05 | global batch size: 256 | lm loss: 1.894452E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.588 | TFLOPs: 40.59 | 15: iteration 94360/ 125429 | consumed samples: 24156160 | consumed tokens: 49471815680 | elapsed time per iteration (s): 1.04 | learning rate: 4.640E-05 | global batch size: 256 | lm loss: 1.905384E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.865 | TFLOPs: 40.80 | 15: iteration 94370/ 125429 | consumed samples: 24158720 | consumed tokens: 49477058560 | elapsed time per iteration (s): 1.05 | learning rate: 4.639E-05 | global batch size: 256 | lm loss: 1.927227E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.744 | TFLOPs: 40.28 | 15: iteration 94380/ 125429 | consumed samples: 24161280 | consumed tokens: 49482301440 | elapsed time per iteration (s): 1.06 | learning rate: 4.637E-05 | global batch size: 256 | lm loss: 1.917848E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.638 | TFLOPs: 39.77 | 15: iteration 94390/ 125429 | consumed samples: 24163840 | consumed tokens: 49487544320 | elapsed time per iteration (s): 1.09 | learning rate: 4.635E-05 | global batch size: 256 | lm loss: 1.930017E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.545 | TFLOPs: 38.93 | 15: iteration 94400/ 125429 | consumed samples: 24166400 | consumed tokens: 49492787200 | elapsed time per iteration (s): 1.05 | learning rate: 4.634E-05 | global batch size: 256 | lm loss: 1.910597E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.051 | TFLOPs: 40.33 | 15: iteration 94410/ 125429 | consumed samples: 24168960 | consumed tokens: 49498030080 | elapsed time per iteration (s): 1.06 | learning rate: 4.632E-05 | global batch size: 256 | lm loss: 1.908466E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.084 | TFLOPs: 39.84 | 15: iteration 94420/ 125429 | consumed samples: 24171520 | consumed tokens: 49503272960 | elapsed time per iteration (s): 1.03 | learning rate: 4.630E-05 | global batch size: 256 | lm loss: 1.938314E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.823 | TFLOPs: 40.95 | 15: iteration 94430/ 125429 | consumed samples: 24174080 | consumed tokens: 49508515840 | elapsed time per iteration (s): 1.03 | learning rate: 4.629E-05 | global batch size: 256 | lm loss: 1.921838E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.350 | TFLOPs: 40.88 | 15: iteration 94440/ 125429 | consumed samples: 24176640 | consumed tokens: 49513758720 | elapsed time per iteration (s): 1.04 | learning rate: 4.627E-05 | global batch size: 256 | lm loss: 1.915727E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.310 | TFLOPs: 40.54 | 15: iteration 94450/ 125429 | consumed samples: 24179200 | consumed tokens: 49519001600 | elapsed time per iteration (s): 1.05 | learning rate: 4.626E-05 | global batch size: 256 | lm loss: 1.924237E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.751 | TFLOPs: 40.28 | 15: iteration 94460/ 125429 | consumed samples: 24181760 | consumed tokens: 49524244480 | elapsed time per iteration (s): 1.20 | learning rate: 4.624E-05 | global batch size: 256 | lm loss: 1.967501E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.879 | TFLOPs: 35.18 | 15: iteration 94470/ 125429 | consumed samples: 24184320 | consumed tokens: 49529487360 | elapsed time per iteration (s): 1.05 | learning rate: 4.622E-05 | global batch size: 256 | lm loss: 1.927706E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.727 | TFLOPs: 40.44 | 15: iteration 94480/ 125429 | consumed samples: 24186880 | consumed tokens: 49534730240 | elapsed time per iteration (s): 1.04 | learning rate: 4.621E-05 | global batch size: 256 | lm loss: 1.904766E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.093 | TFLOPs: 40.50 | 15: iteration 94490/ 125429 | consumed samples: 24189440 | consumed tokens: 49539973120 | elapsed time per iteration (s): 1.03 | learning rate: 4.619E-05 | global batch size: 256 | lm loss: 1.926329E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.597 | TFLOPs: 40.92 | 15: iteration 94500/ 125429 | consumed samples: 24192000 | consumed tokens: 49545216000 | elapsed time per iteration (s): 1.10 | learning rate: 4.618E-05 | global batch size: 256 | lm loss: 1.906735E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.699 | TFLOPs: 38.29 | 15: iteration 94510/ 125429 | consumed samples: 24194560 | consumed tokens: 49550458880 | elapsed time per iteration (s): 1.04 | learning rate: 4.616E-05 | global batch size: 256 | lm loss: 1.901550E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.177 | TFLOPs: 40.68 | 15: iteration 94520/ 125429 | consumed samples: 24197120 | consumed tokens: 49555701760 | elapsed time per iteration (s): 1.03 | learning rate: 4.614E-05 | global batch size: 256 | lm loss: 1.885307E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.502 | TFLOPs: 41.23 | 15: iteration 94530/ 125429 | consumed samples: 24199680 | consumed tokens: 49560944640 | elapsed time per iteration (s): 1.03 | learning rate: 4.613E-05 | global batch size: 256 | lm loss: 1.954200E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.372 | TFLOPs: 41.05 | 15: iteration 94540/ 125429 | consumed samples: 24202240 | consumed tokens: 49566187520 | elapsed time per iteration (s): 1.04 | learning rate: 4.611E-05 | global batch size: 256 | lm loss: 1.899673E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.218 | TFLOPs: 40.52 | 15: iteration 94550/ 125429 | consumed samples: 24204800 | consumed tokens: 49571430400 | elapsed time per iteration (s): 1.04 | learning rate: 4.610E-05 | global batch size: 256 | lm loss: 1.929909E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.215 | TFLOPs: 40.69 | 15: iteration 94560/ 125429 | consumed samples: 24207360 | consumed tokens: 49576673280 | elapsed time per iteration (s): 1.05 | learning rate: 4.608E-05 | global batch size: 256 | lm loss: 1.900060E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.954 | TFLOPs: 40.48 | 15: iteration 94570/ 125429 | consumed samples: 24209920 | consumed tokens: 49581916160 | elapsed time per iteration (s): 1.03 | learning rate: 4.606E-05 | global batch size: 256 | lm loss: 1.913110E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.175 | TFLOPs: 41.01 | 15: iteration 94580/ 125429 | consumed samples: 24212480 | consumed tokens: 49587159040 | elapsed time per iteration (s): 1.04 | learning rate: 4.605E-05 | global batch size: 256 | lm loss: 1.922099E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.880 | TFLOPs: 40.80 | 15: iteration 94590/ 125429 | consumed samples: 24215040 | consumed tokens: 49592401920 | elapsed time per iteration (s): 1.07 | learning rate: 4.603E-05 | global batch size: 256 | lm loss: 1.892232E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.358 | TFLOPs: 39.72 | 15: iteration 94600/ 125429 | consumed samples: 24217600 | consumed tokens: 49597644800 | elapsed time per iteration (s): 1.04 | learning rate: 4.602E-05 | global batch size: 256 | lm loss: 1.931633E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.954 | TFLOPs: 40.65 | 15: iteration 94610/ 125429 | consumed samples: 24220160 | consumed tokens: 49602887680 | elapsed time per iteration (s): 1.03 | learning rate: 4.600E-05 | global batch size: 256 | lm loss: 1.907429E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.363 | TFLOPs: 40.88 | 15: iteration 94620/ 125429 | consumed samples: 24222720 | consumed tokens: 49608130560 | elapsed time per iteration (s): 1.06 | learning rate: 4.598E-05 | global batch size: 256 | lm loss: 1.943890E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.850 | TFLOPs: 39.97 | 15: iteration 94630/ 125429 | consumed samples: 24225280 | consumed tokens: 49613373440 | elapsed time per iteration (s): 1.06 | learning rate: 4.597E-05 | global batch size: 256 | lm loss: 1.925077E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.901 | TFLOPs: 39.81 | 15: iteration 94640/ 125429 | consumed samples: 24227840 | consumed tokens: 49618616320 | elapsed time per iteration (s): 1.05 | learning rate: 4.595E-05 | global batch size: 256 | lm loss: 1.917736E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.012 | TFLOPs: 40.32 | 15: iteration 94650/ 125429 | consumed samples: 24230400 | consumed tokens: 49623859200 | elapsed time per iteration (s): 1.05 | learning rate: 4.594E-05 | global batch size: 256 | lm loss: 1.914356E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.718 | TFLOPs: 40.44 | 15: iteration 94660/ 125429 | consumed samples: 24232960 | consumed tokens: 49629102080 | elapsed time per iteration (s): 1.06 | learning rate: 4.592E-05 | global batch size: 256 | lm loss: 1.912960E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.680 | TFLOPs: 39.77 | 15: iteration 94670/ 125429 | consumed samples: 24235520 | consumed tokens: 49634344960 | elapsed time per iteration (s): 1.18 | learning rate: 4.590E-05 | global batch size: 256 | lm loss: 1.911706E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.356 | TFLOPs: 35.92 | 15: iteration 94680/ 125429 | consumed samples: 24238080 | consumed tokens: 49639587840 | elapsed time per iteration (s): 1.03 | learning rate: 4.589E-05 | global batch size: 256 | lm loss: 1.873517E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.163 | TFLOPs: 41.18 | 15: iteration 94690/ 125429 | consumed samples: 24240640 | consumed tokens: 49644830720 | elapsed time per iteration (s): 1.05 | learning rate: 4.587E-05 | global batch size: 256 | lm loss: 1.904688E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.239 | TFLOPs: 40.36 | 15: iteration 94700/ 125429 | consumed samples: 24243200 | consumed tokens: 49650073600 | elapsed time per iteration (s): 1.02 | learning rate: 4.586E-05 | global batch size: 256 | lm loss: 1.895268E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.988 | TFLOPs: 41.31 | 15: iteration 94710/ 125429 | consumed samples: 24245760 | consumed tokens: 49655316480 | elapsed time per iteration (s): 1.04 | learning rate: 4.584E-05 | global batch size: 256 | lm loss: 1.894558E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.703 | TFLOPs: 40.77 | 15: iteration 94720/ 125429 | consumed samples: 24248320 | consumed tokens: 49660559360 | elapsed time per iteration (s): 1.49 | learning rate: 4.582E-05 | global batch size: 256 | lm loss: 1.944373E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.739 | TFLOPs: 28.38 | 15: iteration 94730/ 125429 | consumed samples: 24250880 | consumed tokens: 49665802240 | elapsed time per iteration (s): 1.06 | learning rate: 4.581E-05 | global batch size: 256 | lm loss: 1.893182E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.787 | TFLOPs: 39.79 | 15: iteration 94740/ 125429 | consumed samples: 24253440 | consumed tokens: 49671045120 | elapsed time per iteration (s): 1.02 | learning rate: 4.579E-05 | global batch size: 256 | lm loss: 1.913801E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.072 | TFLOPs: 41.49 | 15: iteration 94750/ 125429 | consumed samples: 24256000 | consumed tokens: 49676288000 | elapsed time per iteration (s): 1.19 | learning rate: 4.578E-05 | global batch size: 256 | lm loss: 1.912712E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.929 | TFLOPs: 35.68 | 15: iteration 94760/ 125429 | consumed samples: 24258560 | consumed tokens: 49681530880 | elapsed time per iteration (s): 1.04 | learning rate: 4.576E-05 | global batch size: 256 | lm loss: 1.934264E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.068 | TFLOPs: 40.50 | 15: iteration 94770/ 125429 | consumed samples: 24261120 | consumed tokens: 49686773760 | elapsed time per iteration (s): 1.03 | learning rate: 4.574E-05 | global batch size: 256 | lm loss: 1.922176E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.566 | TFLOPs: 41.24 | 15: iteration 94780/ 125429 | consumed samples: 24263680 | consumed tokens: 49692016640 | elapsed time per iteration (s): 1.03 | learning rate: 4.573E-05 | global batch size: 256 | lm loss: 1.919949E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.601 | TFLOPs: 41.25 | 15: iteration 94790/ 125429 | consumed samples: 24266240 | consumed tokens: 49697259520 | elapsed time per iteration (s): 1.05 | learning rate: 4.571E-05 | global batch size: 256 | lm loss: 1.932812E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.726 | TFLOPs: 40.11 | 15: iteration 94800/ 125429 | consumed samples: 24268800 | consumed tokens: 49702502400 | elapsed time per iteration (s): 1.03 | learning rate: 4.570E-05 | global batch size: 256 | lm loss: 1.912052E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.760 | TFLOPs: 41.11 | 15: iteration 94810/ 125429 | consumed samples: 24271360 | consumed tokens: 49707745280 | elapsed time per iteration (s): 1.03 | learning rate: 4.568E-05 | global batch size: 256 | lm loss: 1.922175E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.439 | TFLOPs: 40.89 | 15: iteration 94820/ 125429 | consumed samples: 24273920 | consumed tokens: 49712988160 | elapsed time per iteration (s): 1.04 | learning rate: 4.566E-05 | global batch size: 256 | lm loss: 1.925830E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.631 | TFLOPs: 40.76 | 15: iteration 94830/ 125429 | consumed samples: 24276480 | consumed tokens: 49718231040 | elapsed time per iteration (s): 1.04 | learning rate: 4.565E-05 | global batch size: 256 | lm loss: 1.912304E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.130 | TFLOPs: 40.51 | 15: iteration 94840/ 125429 | consumed samples: 24279040 | consumed tokens: 49723473920 | elapsed time per iteration (s): 1.05 | learning rate: 4.563E-05 | global batch size: 256 | lm loss: 1.905539E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.727 | TFLOPs: 40.44 | 15: iteration 94850/ 125429 | consumed samples: 24281600 | consumed tokens: 49728716800 | elapsed time per iteration (s): 1.07 | learning rate: 4.562E-05 | global batch size: 256 | lm loss: 1.900731E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.857 | TFLOPs: 39.64 | 15: iteration 94860/ 125429 | consumed samples: 24284160 | consumed tokens: 49733959680 | elapsed time per iteration (s): 1.06 | learning rate: 4.560E-05 | global batch size: 256 | lm loss: 1.905476E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.690 | TFLOPs: 39.94 | 15: iteration 94870/ 125429 | consumed samples: 24286720 | consumed tokens: 49739202560 | elapsed time per iteration (s): 1.03 | learning rate: 4.558E-05 | global batch size: 256 | lm loss: 1.908198E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.399 | TFLOPs: 41.05 | 15: iteration 94880/ 125429 | consumed samples: 24289280 | consumed tokens: 49744445440 | elapsed time per iteration (s): 1.05 | learning rate: 4.557E-05 | global batch size: 256 | lm loss: 1.944499E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.929 | TFLOPs: 40.15 | 15: iteration 94890/ 125429 | consumed samples: 24291840 | consumed tokens: 49749688320 | elapsed time per iteration (s): 1.04 | learning rate: 4.555E-05 | global batch size: 256 | lm loss: 1.961963E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.804 | TFLOPs: 40.62 | 15: iteration 94900/ 125429 | consumed samples: 24294400 | consumed tokens: 49754931200 | elapsed time per iteration (s): 3.01 | learning rate: 4.554E-05 | global batch size: 256 | lm loss: 1.940008E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 85.175 | TFLOPs: 14.08 | 15: iteration 94910/ 125429 | consumed samples: 24296960 | consumed tokens: 49760174080 | elapsed time per iteration (s): 1.04 | learning rate: 4.552E-05 | global batch size: 256 | lm loss: 1.922207E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.453 | TFLOPs: 40.73 | 15: iteration 94920/ 125429 | consumed samples: 24299520 | consumed tokens: 49765416960 | elapsed time per iteration (s): 1.02 | learning rate: 4.551E-05 | global batch size: 256 | lm loss: 1.934249E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.816 | TFLOPs: 41.28 | 15: iteration 94930/ 125429 | consumed samples: 24302080 | consumed tokens: 49770659840 | elapsed time per iteration (s): 1.03 | learning rate: 4.549E-05 | global batch size: 256 | lm loss: 1.936452E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.283 | TFLOPs: 41.03 | 15: iteration 94940/ 125429 | consumed samples: 24304640 | consumed tokens: 49775902720 | elapsed time per iteration (s): 1.09 | learning rate: 4.547E-05 | global batch size: 256 | lm loss: 1.909521E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.805 | TFLOPs: 38.97 | 15: iteration 94950/ 125429 | consumed samples: 24307200 | consumed tokens: 49781145600 | elapsed time per iteration (s): 1.05 | learning rate: 4.546E-05 | global batch size: 256 | lm loss: 1.931768E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.932 | TFLOPs: 40.31 | 15: iteration 94960/ 125429 | consumed samples: 24309760 | consumed tokens: 49786388480 | elapsed time per iteration (s): 1.05 | learning rate: 4.544E-05 | global batch size: 256 | lm loss: 1.934399E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.975 | TFLOPs: 40.32 | 15: iteration 94970/ 125429 | consumed samples: 24312320 | consumed tokens: 49791631360 | elapsed time per iteration (s): 1.05 | learning rate: 4.543E-05 | global batch size: 256 | lm loss: 1.913771E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.411 | TFLOPs: 40.39 | 15: iteration 94980/ 125429 | consumed samples: 24314880 | consumed tokens: 49796874240 | elapsed time per iteration (s): 1.04 | learning rate: 4.541E-05 | global batch size: 256 | lm loss: 1.908044E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.824 | TFLOPs: 40.62 | 15: iteration 94990/ 125429 | consumed samples: 24317440 | consumed tokens: 49802117120 | elapsed time per iteration (s): 1.04 | learning rate: 4.539E-05 | global batch size: 256 | lm loss: 1.907227E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.570 | TFLOPs: 40.58 | 15: iteration 95000/ 125429 | consumed samples: 24320000 | consumed tokens: 49807360000 | elapsed time per iteration (s): 1.05 | learning rate: 4.538E-05 | global batch size: 256 | lm loss: 1.917522E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.938 | TFLOPs: 40.48 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 95000 | lm loss value: 1.828116E+00 | lm loss PPL: 6.222153E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 95000 to checkpoints_1b5 0: [2022-11-27 00:11:04,534] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step95000 is begin to save! 0: [2022-11-27 00:11:04,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:11:04,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:11:04,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:11:04,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:11:04,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:11:04,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:11:04,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:11:05,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:11:05,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:11:05,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:11:05,196] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:11:05,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:11:05,299] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:11:05,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:11:05,404] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:11:05,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:11:05,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:11:05,608] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:11:05,608] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:11:05,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:11:05,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:11:05,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:11:05,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:11:05,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:11:05,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:11:06,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:11:06,043] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:11:06,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:11:06,155] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:11:06,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:11:06,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:11:06,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:11:06,383] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:11:06,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:11:06,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:11:06,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:11:06,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:11:06,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:11:06,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:11:06,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:11:06,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:11:06,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:11:06,943] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:11:07,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:11:07,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:11:07,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:11:07,163] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:11:07,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:11:07,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:11:07,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:11:07,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:11:07,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:11:07,497] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:11:07,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:11:07,601] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_29-model_00-model_states.pt... 0: [2022-11-27 00:11:07,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_29-model_00-model_states.pt. 0: [2022-11-27 00:11:07,709] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:11:07,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:11:07,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/layer_32-model_00-model_states.pt... 0: [2022-11-27 00:11:07,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/layer_32-model_00-model_states.pt. 0: [2022-11-27 00:11:07,823] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step95000/mp_rank_00_model_states.pt 0: [2022-11-27 00:11:07,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:11:07,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:11:07,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step95000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:11:08,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-27 00:11:08,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,026] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,026] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-27 00:11:08,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-27 00:11:08,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:11:08,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 00:11:08,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 00:11:08,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-27 00:11:08,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,025] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,025] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:11:08,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:11:08,037] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:11:08,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 00:11:08,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:11:08,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:11:08,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:11:08,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:11:08,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:11:08,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-27 00:11:08,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:11:08,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:11:08,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:11:08,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 00:11:08,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:11:08,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-27 00:11:08,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:11:08,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-27 00:11:08,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:11:08,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:11:08,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-27 00:11:08,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:11:08,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:11:08,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 00:11:08,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:11:08,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:11:08,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 2: [2022-11-27 00:11:08,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:11:08,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 00:11:08,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:11:08,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 00:11:08,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-27 00:11:08,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-27 00:11:08,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-27 00:11:08,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:11:08,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-27 00:11:08,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:11:08,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 00:11:08,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 00:11:08,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 12: [2022-11-27 00:11:08,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 7: [2022-11-27 00:11:08,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:11:08,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 00:11:08,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 9: [2022-11-27 00:11:08,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:11:08,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 00:11:08,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 5: [2022-11-27 00:11:08,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:11:08,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:11:08,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 15: [2022-11-27 00:11:08,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:11:08,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 00:11:08,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 1: [2022-11-27 00:11:08,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:11:08,076] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 00:11:08,076] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-27 00:11:08,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-27 00:11:08,041] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,041] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,041] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-27 00:11:08,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 6: [2022-11-27 00:11:08,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-27 00:11:08,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 8: [2022-11-27 00:11:08,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:11:08,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:11:08,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:11:08,040] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,040] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:11:08,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,042] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:11:08,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:11:08,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:11:08,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:11:08,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 4: [2022-11-27 00:11:08,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 10: [2022-11-27 00:11:08,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:11:08,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 00:11:08,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:11:08,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:11:08,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:11:08,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:11:08,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 14: [2022-11-27 00:11:08,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:11:08,089] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:11:08,089] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,090] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:11:08,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 00:11:08,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 13: [2022-11-27 00:11:08,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-27 00:11:08,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-27 00:11:08,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,129] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,129] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-27 00:11:08,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:11:08,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 11: [2022-11-27 00:11:08,124] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:11:08,124] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 00:11:08,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: [2022-11-27 00:11:08,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:11:08,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:11:08,293] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step95000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 00:11:08,293] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step95000 is ready now! 0: successfully saved checkpoint at iteration 95000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3767.09 15: iteration 95010/ 125429 | consumed samples: 24322560 | consumed tokens: 49812602880 | elapsed time per iteration (s): 1.47 | learning rate: 4.536E-05 | global batch size: 256 | lm loss: 1.920743E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.576 | TFLOPs: 28.68 | 15: iteration 95020/ 125429 | consumed samples: 24325120 | consumed tokens: 49817845760 | elapsed time per iteration (s): 1.05 | learning rate: 4.535E-05 | global batch size: 256 | lm loss: 1.905273E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.658 | TFLOPs: 40.43 | 15: iteration 95030/ 125429 | consumed samples: 24327680 | consumed tokens: 49823088640 | elapsed time per iteration (s): 1.03 | learning rate: 4.533E-05 | global batch size: 256 | lm loss: 1.928726E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.821 | TFLOPs: 40.95 | 15: iteration 95040/ 125429 | consumed samples: 24330240 | consumed tokens: 49828331520 | elapsed time per iteration (s): 1.08 | learning rate: 4.532E-05 | global batch size: 256 | lm loss: 1.939750E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.933 | TFLOPs: 39.15 | 15: iteration 95050/ 125429 | consumed samples: 24332800 | consumed tokens: 49833574400 | elapsed time per iteration (s): 1.07 | learning rate: 4.530E-05 | global batch size: 256 | lm loss: 1.902774E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.587 | TFLOPs: 39.43 | 15: iteration 95060/ 125429 | consumed samples: 24335360 | consumed tokens: 49838817280 | elapsed time per iteration (s): 1.04 | learning rate: 4.528E-05 | global batch size: 256 | lm loss: 1.883412E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.198 | TFLOPs: 40.69 | 15: iteration 95070/ 125429 | consumed samples: 24337920 | consumed tokens: 49844060160 | elapsed time per iteration (s): 1.06 | learning rate: 4.527E-05 | global batch size: 256 | lm loss: 1.900139E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.950 | TFLOPs: 39.82 | 15: iteration 95080/ 125429 | consumed samples: 24340480 | consumed tokens: 49849303040 | elapsed time per iteration (s): 1.05 | learning rate: 4.525E-05 | global batch size: 256 | lm loss: 1.916328E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.293 | TFLOPs: 40.37 | 15: iteration 95090/ 125429 | consumed samples: 24343040 | consumed tokens: 49854545920 | elapsed time per iteration (s): 1.03 | learning rate: 4.524E-05 | global batch size: 256 | lm loss: 1.917110E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.800 | TFLOPs: 41.12 | 15: iteration 95100/ 125429 | consumed samples: 24345600 | consumed tokens: 49859788800 | elapsed time per iteration (s): 1.04 | learning rate: 4.522E-05 | global batch size: 256 | lm loss: 1.945728E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.873 | TFLOPs: 40.63 | 15: iteration 95110/ 125429 | consumed samples: 24348160 | consumed tokens: 49865031680 | elapsed time per iteration (s): 1.04 | learning rate: 4.520E-05 | global batch size: 256 | lm loss: 1.945932E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.176 | TFLOPs: 40.85 | 15: iteration 95120/ 125429 | consumed samples: 24350720 | consumed tokens: 49870274560 | elapsed time per iteration (s): 1.04 | learning rate: 4.519E-05 | global batch size: 256 | lm loss: 1.940946E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.468 | TFLOPs: 40.57 | 15: iteration 95130/ 125429 | consumed samples: 24353280 | consumed tokens: 49875517440 | elapsed time per iteration (s): 1.03 | learning rate: 4.517E-05 | global batch size: 256 | lm loss: 1.933562E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.765 | TFLOPs: 41.11 | 15: iteration 95140/ 125429 | consumed samples: 24355840 | consumed tokens: 49880760320 | elapsed time per iteration (s): 1.02 | learning rate: 4.516E-05 | global batch size: 256 | lm loss: 1.914038E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.990 | TFLOPs: 41.48 | 15: iteration 95150/ 125429 | consumed samples: 24358400 | consumed tokens: 49886003200 | elapsed time per iteration (s): 1.02 | learning rate: 4.514E-05 | global batch size: 256 | lm loss: 1.902863E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.867 | TFLOPs: 41.29 | 15: iteration 95160/ 125429 | consumed samples: 24360960 | consumed tokens: 49891246080 | elapsed time per iteration (s): 1.03 | learning rate: 4.513E-05 | global batch size: 256 | lm loss: 1.912935E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.315 | TFLOPs: 41.04 | 15: iteration 95170/ 125429 | consumed samples: 24363520 | consumed tokens: 49896488960 | elapsed time per iteration (s): 1.03 | learning rate: 4.511E-05 | global batch size: 256 | lm loss: 1.932291E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.707 | TFLOPs: 41.10 | 15: iteration 95180/ 125429 | consumed samples: 24366080 | consumed tokens: 49901731840 | elapsed time per iteration (s): 1.04 | learning rate: 4.509E-05 | global batch size: 256 | lm loss: 1.917585E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.324 | TFLOPs: 40.71 | 15: iteration 95190/ 125429 | consumed samples: 24368640 | consumed tokens: 49906974720 | elapsed time per iteration (s): 1.04 | learning rate: 4.508E-05 | global batch size: 256 | lm loss: 1.929360E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.138 | TFLOPs: 40.68 | 15: iteration 95200/ 125429 | consumed samples: 24371200 | consumed tokens: 49912217600 | elapsed time per iteration (s): 1.02 | learning rate: 4.506E-05 | global batch size: 256 | lm loss: 1.909388E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.879 | TFLOPs: 41.46 | 15: iteration 95210/ 125429 | consumed samples: 24373760 | consumed tokens: 49917460480 | elapsed time per iteration (s): 1.04 | learning rate: 4.505E-05 | global batch size: 256 | lm loss: 1.927550E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.694 | TFLOPs: 40.60 | 15: iteration 95220/ 125429 | consumed samples: 24376320 | consumed tokens: 49922703360 | elapsed time per iteration (s): 1.03 | learning rate: 4.503E-05 | global batch size: 256 | lm loss: 1.930970E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.535 | TFLOPs: 41.07 | 15: iteration 95230/ 125429 | consumed samples: 24378880 | consumed tokens: 49927946240 | elapsed time per iteration (s): 1.03 | learning rate: 4.502E-05 | global batch size: 256 | lm loss: 1.902732E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.084 | TFLOPs: 41.16 | 15: iteration 95240/ 125429 | consumed samples: 24381440 | consumed tokens: 49933189120 | elapsed time per iteration (s): 1.04 | learning rate: 4.500E-05 | global batch size: 256 | lm loss: 1.922602E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.335 | TFLOPs: 40.87 | 15: iteration 95250/ 125429 | consumed samples: 24384000 | consumed tokens: 49938432000 | elapsed time per iteration (s): 1.04 | learning rate: 4.498E-05 | global batch size: 256 | lm loss: 1.924706E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.031 | TFLOPs: 40.66 | 15: iteration 95260/ 125429 | consumed samples: 24386560 | consumed tokens: 49943674880 | elapsed time per iteration (s): 1.04 | learning rate: 4.497E-05 | global batch size: 256 | lm loss: 1.906306E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.585 | TFLOPs: 40.75 | 15: iteration 95270/ 125429 | consumed samples: 24389120 | consumed tokens: 49948917760 | elapsed time per iteration (s): 1.03 | learning rate: 4.495E-05 | global batch size: 256 | lm loss: 1.895890E+00 | grad norm: 0.141 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.483 | TFLOPs: 41.06 | 15: iteration 95280/ 125429 | consumed samples: 24391680 | consumed tokens: 49954160640 | elapsed time per iteration (s): 1.03 | learning rate: 4.494E-05 | global batch size: 256 | lm loss: 1.946090E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.380 | TFLOPs: 41.05 | 15: iteration 95290/ 125429 | consumed samples: 24394240 | consumed tokens: 49959403520 | elapsed time per iteration (s): 1.03 | learning rate: 4.492E-05 | global batch size: 256 | lm loss: 1.887840E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.918 | TFLOPs: 41.14 | 15: iteration 95300/ 125429 | consumed samples: 24396800 | consumed tokens: 49964646400 | elapsed time per iteration (s): 1.04 | learning rate: 4.490E-05 | global batch size: 256 | lm loss: 1.931958E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.088 | TFLOPs: 40.83 | 15: iteration 95310/ 125429 | consumed samples: 24399360 | consumed tokens: 49969889280 | elapsed time per iteration (s): 1.05 | learning rate: 4.489E-05 | global batch size: 256 | lm loss: 1.909228E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.660 | TFLOPs: 40.43 | 15: iteration 95320/ 125429 | consumed samples: 24401920 | consumed tokens: 49975132160 | elapsed time per iteration (s): 1.07 | learning rate: 4.487E-05 | global batch size: 256 | lm loss: 1.897824E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.750 | TFLOPs: 39.62 | 15: iteration 95330/ 125429 | consumed samples: 24404480 | consumed tokens: 49980375040 | elapsed time per iteration (s): 1.33 | learning rate: 4.486E-05 | global batch size: 256 | lm loss: 1.925898E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 192.152 | TFLOPs: 31.75 | 15: iteration 95340/ 125429 | consumed samples: 24407040 | consumed tokens: 49985617920 | elapsed time per iteration (s): 1.03 | learning rate: 4.484E-05 | global batch size: 256 | lm loss: 1.936407E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.494 | TFLOPs: 41.23 | 15: iteration 95350/ 125429 | consumed samples: 24409600 | consumed tokens: 49990860800 | elapsed time per iteration (s): 1.03 | learning rate: 4.483E-05 | global batch size: 256 | lm loss: 1.890130E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.697 | TFLOPs: 41.10 | 15: iteration 95360/ 125429 | consumed samples: 24412160 | consumed tokens: 49996103680 | elapsed time per iteration (s): 1.05 | learning rate: 4.481E-05 | global batch size: 256 | lm loss: 1.881464E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.683 | TFLOPs: 40.44 | 15: iteration 95370/ 125429 | consumed samples: 24414720 | consumed tokens: 50001346560 | elapsed time per iteration (s): 1.05 | learning rate: 4.479E-05 | global batch size: 256 | lm loss: 1.925216E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.859 | TFLOPs: 40.13 | 15: iteration 95380/ 125429 | consumed samples: 24417280 | consumed tokens: 50006589440 | elapsed time per iteration (s): 1.26 | learning rate: 4.478E-05 | global batch size: 256 | lm loss: 1.909730E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 203.649 | TFLOPs: 33.65 | 15: iteration 95390/ 125429 | consumed samples: 24419840 | consumed tokens: 50011832320 | elapsed time per iteration (s): 1.03 | learning rate: 4.476E-05 | global batch size: 256 | lm loss: 1.934093E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.111 | TFLOPs: 41.17 | 15: iteration 95400/ 125429 | consumed samples: 24422400 | consumed tokens: 50017075200 | elapsed time per iteration (s): 1.07 | learning rate: 4.475E-05 | global batch size: 256 | lm loss: 1.907922E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.423 | TFLOPs: 39.57 | 15: iteration 95410/ 125429 | consumed samples: 24424960 | consumed tokens: 50022318080 | elapsed time per iteration (s): 1.08 | learning rate: 4.473E-05 | global batch size: 256 | lm loss: 1.891475E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.693 | TFLOPs: 39.28 | 15: iteration 95420/ 125429 | consumed samples: 24427520 | consumed tokens: 50027560960 | elapsed time per iteration (s): 1.03 | learning rate: 4.472E-05 | global batch size: 256 | lm loss: 1.909748E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.825 | TFLOPs: 41.12 | 15: iteration 95430/ 125429 | consumed samples: 24430080 | consumed tokens: 50032803840 | elapsed time per iteration (s): 1.04 | learning rate: 4.470E-05 | global batch size: 256 | lm loss: 1.891168E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.517 | TFLOPs: 40.57 | 15: iteration 95440/ 125429 | consumed samples: 24432640 | consumed tokens: 50038046720 | elapsed time per iteration (s): 1.04 | learning rate: 4.469E-05 | global batch size: 256 | lm loss: 1.925902E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.714 | TFLOPs: 40.77 | 15: iteration 95450/ 125429 | consumed samples: 24435200 | consumed tokens: 50043289600 | elapsed time per iteration (s): 1.18 | learning rate: 4.467E-05 | global batch size: 256 | lm loss: 1.912179E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.467 | TFLOPs: 35.77 | 15: iteration 95460/ 125429 | consumed samples: 24437760 | consumed tokens: 50048532480 | elapsed time per iteration (s): 1.04 | learning rate: 4.465E-05 | global batch size: 256 | lm loss: 1.936296E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.724 | TFLOPs: 40.61 | 15: iteration 95470/ 125429 | consumed samples: 24440320 | consumed tokens: 50053775360 | elapsed time per iteration (s): 1.02 | learning rate: 4.464E-05 | global batch size: 256 | lm loss: 1.934583E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.599 | TFLOPs: 41.41 | 15: iteration 95480/ 125429 | consumed samples: 24442880 | consumed tokens: 50059018240 | elapsed time per iteration (s): 1.03 | learning rate: 4.462E-05 | global batch size: 256 | lm loss: 1.908130E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.971 | TFLOPs: 41.14 | 15: iteration 95490/ 125429 | consumed samples: 24445440 | consumed tokens: 50064261120 | elapsed time per iteration (s): 1.05 | learning rate: 4.461E-05 | global batch size: 256 | lm loss: 1.901632E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.236 | TFLOPs: 40.20 | 15: iteration 95500/ 125429 | consumed samples: 24448000 | consumed tokens: 50069504000 | elapsed time per iteration (s): 1.04 | learning rate: 4.459E-05 | global batch size: 256 | lm loss: 1.939758E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.626 | TFLOPs: 40.76 | 15: iteration 95510/ 125429 | consumed samples: 24450560 | consumed tokens: 50074746880 | elapsed time per iteration (s): 1.04 | learning rate: 4.458E-05 | global batch size: 256 | lm loss: 1.924595E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.679 | TFLOPs: 40.60 | 15: iteration 95520/ 125429 | consumed samples: 24453120 | consumed tokens: 50079989760 | elapsed time per iteration (s): 1.04 | learning rate: 4.456E-05 | global batch size: 256 | lm loss: 1.951326E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.287 | TFLOPs: 40.87 | 15: iteration 95530/ 125429 | consumed samples: 24455680 | consumed tokens: 50085232640 | elapsed time per iteration (s): 1.04 | learning rate: 4.454E-05 | global batch size: 256 | lm loss: 1.939211E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.022 | TFLOPs: 40.82 | 15: iteration 95540/ 125429 | consumed samples: 24458240 | consumed tokens: 50090475520 | elapsed time per iteration (s): 1.03 | learning rate: 4.453E-05 | global batch size: 256 | lm loss: 1.900307E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.594 | TFLOPs: 41.25 | 15: iteration 95550/ 125429 | consumed samples: 24460800 | consumed tokens: 50095718400 | elapsed time per iteration (s): 1.04 | learning rate: 4.451E-05 | global batch size: 256 | lm loss: 1.904973E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.424 | TFLOPs: 40.56 | 15: iteration 95560/ 125429 | consumed samples: 24463360 | consumed tokens: 50100961280 | elapsed time per iteration (s): 1.04 | learning rate: 4.450E-05 | global batch size: 256 | lm loss: 1.916953E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.501 | TFLOPs: 40.57 | 15: iteration 95570/ 125429 | consumed samples: 24465920 | consumed tokens: 50106204160 | elapsed time per iteration (s): 1.06 | learning rate: 4.448E-05 | global batch size: 256 | lm loss: 1.920431E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.651 | TFLOPs: 40.10 | 15: iteration 95580/ 125429 | consumed samples: 24468480 | consumed tokens: 50111447040 | elapsed time per iteration (s): 1.06 | learning rate: 4.447E-05 | global batch size: 256 | lm loss: 1.902509E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.042 | TFLOPs: 40.00 | 15: iteration 95590/ 125429 | consumed samples: 24471040 | consumed tokens: 50116689920 | elapsed time per iteration (s): 1.05 | learning rate: 4.445E-05 | global batch size: 256 | lm loss: 1.912065E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.245 | TFLOPs: 40.20 | 15: iteration 95600/ 125429 | consumed samples: 24473600 | consumed tokens: 50121932800 | elapsed time per iteration (s): 1.04 | learning rate: 4.444E-05 | global batch size: 256 | lm loss: 1.896466E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.765 | TFLOPs: 40.78 | 15: iteration 95610/ 125429 | consumed samples: 24476160 | consumed tokens: 50127175680 | elapsed time per iteration (s): 1.03 | learning rate: 4.442E-05 | global batch size: 256 | lm loss: 1.907536E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.553 | TFLOPs: 40.91 | 15: iteration 95620/ 125429 | consumed samples: 24478720 | consumed tokens: 50132418560 | elapsed time per iteration (s): 1.18 | learning rate: 4.440E-05 | global batch size: 256 | lm loss: 1.917629E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.342 | TFLOPs: 35.75 | 15: iteration 95630/ 125429 | consumed samples: 24481280 | consumed tokens: 50137661440 | elapsed time per iteration (s): 1.05 | learning rate: 4.439E-05 | global batch size: 256 | lm loss: 1.922865E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.810 | TFLOPs: 40.46 | 15: iteration 95640/ 125429 | consumed samples: 24483840 | consumed tokens: 50142904320 | elapsed time per iteration (s): 1.05 | learning rate: 4.437E-05 | global batch size: 256 | lm loss: 1.920781E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.397 | TFLOPs: 40.39 | 15: iteration 95650/ 125429 | consumed samples: 24486400 | consumed tokens: 50148147200 | elapsed time per iteration (s): 1.05 | learning rate: 4.436E-05 | global batch size: 256 | lm loss: 1.927403E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.678 | TFLOPs: 40.10 | 15: iteration 95660/ 125429 | consumed samples: 24488960 | consumed tokens: 50153390080 | elapsed time per iteration (s): 1.02 | learning rate: 4.434E-05 | global batch size: 256 | lm loss: 1.891596E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.353 | TFLOPs: 41.37 | 15: iteration 95670/ 125429 | consumed samples: 24491520 | consumed tokens: 50158632960 | elapsed time per iteration (s): 1.06 | learning rate: 4.433E-05 | global batch size: 256 | lm loss: 1.897131E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.917 | TFLOPs: 39.81 | 15: iteration 95680/ 125429 | consumed samples: 24494080 | consumed tokens: 50163875840 | elapsed time per iteration (s): 1.05 | learning rate: 4.431E-05 | global batch size: 256 | lm loss: 1.928704E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.444 | TFLOPs: 40.40 | 15: iteration 95690/ 125429 | consumed samples: 24496640 | consumed tokens: 50169118720 | elapsed time per iteration (s): 1.05 | learning rate: 4.429E-05 | global batch size: 256 | lm loss: 1.916002E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.507 | TFLOPs: 40.41 | 15: iteration 95700/ 125429 | consumed samples: 24499200 | consumed tokens: 50174361600 | elapsed time per iteration (s): 1.03 | learning rate: 4.428E-05 | global batch size: 256 | lm loss: 1.953323E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.002 | TFLOPs: 41.15 | 15: iteration 95710/ 125429 | consumed samples: 24501760 | consumed tokens: 50179604480 | elapsed time per iteration (s): 1.10 | learning rate: 4.426E-05 | global batch size: 256 | lm loss: 1.919062E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.145 | TFLOPs: 38.53 | 15: iteration 95720/ 125429 | consumed samples: 24504320 | consumed tokens: 50184847360 | elapsed time per iteration (s): 1.61 | learning rate: 4.425E-05 | global batch size: 256 | lm loss: 1.913765E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 158.643 | TFLOPs: 26.22 | 15: iteration 95730/ 125429 | consumed samples: 24506880 | consumed tokens: 50190090240 | elapsed time per iteration (s): 1.03 | learning rate: 4.423E-05 | global batch size: 256 | lm loss: 1.885843E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.049 | TFLOPs: 40.99 | 15: iteration 95740/ 125429 | consumed samples: 24509440 | consumed tokens: 50195333120 | elapsed time per iteration (s): 1.20 | learning rate: 4.422E-05 | global batch size: 256 | lm loss: 1.915973E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.480 | TFLOPs: 35.11 | 15: iteration 95750/ 125429 | consumed samples: 24512000 | consumed tokens: 50200576000 | elapsed time per iteration (s): 1.04 | learning rate: 4.420E-05 | global batch size: 256 | lm loss: 1.889742E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.761 | TFLOPs: 40.61 | 15: iteration 95760/ 125429 | consumed samples: 24514560 | consumed tokens: 50205818880 | elapsed time per iteration (s): 1.05 | learning rate: 4.419E-05 | global batch size: 256 | lm loss: 1.894583E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.749 | TFLOPs: 40.12 | 15: iteration 95770/ 125429 | consumed samples: 24517120 | consumed tokens: 50211061760 | elapsed time per iteration (s): 1.10 | learning rate: 4.417E-05 | global batch size: 256 | lm loss: 1.908811E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.693 | TFLOPs: 38.45 | 15: iteration 95780/ 125429 | consumed samples: 24519680 | consumed tokens: 50216304640 | elapsed time per iteration (s): 1.03 | learning rate: 4.416E-05 | global batch size: 256 | lm loss: 1.913947E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.965 | TFLOPs: 41.14 | 15: iteration 95790/ 125429 | consumed samples: 24522240 | consumed tokens: 50221547520 | elapsed time per iteration (s): 1.02 | learning rate: 4.414E-05 | global batch size: 256 | lm loss: 1.928658E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.543 | TFLOPs: 41.40 | 15: iteration 95800/ 125429 | consumed samples: 24524800 | consumed tokens: 50226790400 | elapsed time per iteration (s): 1.03 | learning rate: 4.412E-05 | global batch size: 256 | lm loss: 1.925572E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.787 | TFLOPs: 41.11 | 15: iteration 95810/ 125429 | consumed samples: 24527360 | consumed tokens: 50232033280 | elapsed time per iteration (s): 1.03 | learning rate: 4.411E-05 | global batch size: 256 | lm loss: 1.917551E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.280 | TFLOPs: 41.03 | 15: iteration 95820/ 125429 | consumed samples: 24529920 | consumed tokens: 50237276160 | elapsed time per iteration (s): 1.04 | learning rate: 4.409E-05 | global batch size: 256 | lm loss: 1.928656E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.068 | TFLOPs: 40.83 | 15: iteration 95830/ 125429 | consumed samples: 24532480 | consumed tokens: 50242519040 | elapsed time per iteration (s): 1.13 | learning rate: 4.408E-05 | global batch size: 256 | lm loss: 1.939620E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.873 | TFLOPs: 37.33 | 15: iteration 95840/ 125429 | consumed samples: 24535040 | consumed tokens: 50247761920 | elapsed time per iteration (s): 1.03 | learning rate: 4.406E-05 | global batch size: 256 | lm loss: 1.913363E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.274 | TFLOPs: 41.03 | 15: iteration 95850/ 125429 | consumed samples: 24537600 | consumed tokens: 50253004800 | elapsed time per iteration (s): 1.09 | learning rate: 4.405E-05 | global batch size: 256 | lm loss: 1.923400E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.237 | TFLOPs: 38.71 | 15: iteration 95860/ 125429 | consumed samples: 24540160 | consumed tokens: 50258247680 | elapsed time per iteration (s): 1.06 | learning rate: 4.403E-05 | global batch size: 256 | lm loss: 1.929581E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.177 | TFLOPs: 40.02 | 15: iteration 95870/ 125429 | consumed samples: 24542720 | consumed tokens: 50263490560 | elapsed time per iteration (s): 1.07 | learning rate: 4.402E-05 | global batch size: 256 | lm loss: 1.947966E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.861 | TFLOPs: 39.47 | 15: iteration 95880/ 125429 | consumed samples: 24545280 | consumed tokens: 50268733440 | elapsed time per iteration (s): 1.03 | learning rate: 4.400E-05 | global batch size: 256 | lm loss: 1.891960E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.394 | TFLOPs: 40.88 | 15: iteration 95890/ 125429 | consumed samples: 24547840 | consumed tokens: 50273976320 | elapsed time per iteration (s): 1.05 | learning rate: 4.398E-05 | global batch size: 256 | lm loss: 1.921794E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.866 | TFLOPs: 40.14 | 15: iteration 95900/ 125429 | consumed samples: 24550400 | consumed tokens: 50279219200 | elapsed time per iteration (s): 1.03 | learning rate: 4.397E-05 | global batch size: 256 | lm loss: 1.920904E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.383 | TFLOPs: 41.05 | 15: iteration 95910/ 125429 | consumed samples: 24552960 | consumed tokens: 50284462080 | elapsed time per iteration (s): 1.06 | learning rate: 4.395E-05 | global batch size: 256 | lm loss: 1.921915E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.579 | TFLOPs: 39.92 | 15: iteration 95920/ 125429 | consumed samples: 24555520 | consumed tokens: 50289704960 | elapsed time per iteration (s): 1.04 | learning rate: 4.394E-05 | global batch size: 256 | lm loss: 1.900014E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.894 | TFLOPs: 40.64 | 15: iteration 95930/ 125429 | consumed samples: 24558080 | consumed tokens: 50294947840 | elapsed time per iteration (s): 1.05 | learning rate: 4.392E-05 | global batch size: 256 | lm loss: 1.908781E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.768 | TFLOPs: 40.12 | 15: iteration 95940/ 125429 | consumed samples: 24560640 | consumed tokens: 50300190720 | elapsed time per iteration (s): 1.03 | learning rate: 4.391E-05 | global batch size: 256 | lm loss: 1.893363E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.597 | TFLOPs: 41.25 | 15: iteration 95950/ 125429 | consumed samples: 24563200 | consumed tokens: 50305433600 | elapsed time per iteration (s): 1.03 | learning rate: 4.389E-05 | global batch size: 256 | lm loss: 1.908160E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.610 | TFLOPs: 41.25 | 15: iteration 95960/ 125429 | consumed samples: 24565760 | consumed tokens: 50310676480 | elapsed time per iteration (s): 1.04 | learning rate: 4.388E-05 | global batch size: 256 | lm loss: 1.900024E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.745 | TFLOPs: 40.78 | 15: iteration 95970/ 125429 | consumed samples: 24568320 | consumed tokens: 50315919360 | elapsed time per iteration (s): 1.05 | learning rate: 4.386E-05 | global batch size: 256 | lm loss: 1.881845E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.260 | TFLOPs: 40.20 | 15: iteration 95980/ 125429 | consumed samples: 24570880 | consumed tokens: 50321162240 | elapsed time per iteration (s): 1.03 | learning rate: 4.385E-05 | global batch size: 256 | lm loss: 1.899637E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.613 | TFLOPs: 40.92 | 15: iteration 95990/ 125429 | consumed samples: 24573440 | consumed tokens: 50326405120 | elapsed time per iteration (s): 1.05 | learning rate: 4.383E-05 | global batch size: 256 | lm loss: 1.940081E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.435 | TFLOPs: 40.23 | 0: [2022-11-27 00:28:47,280] [INFO] [logging.py:68:log_dist] [Rank 0] step=96000, skipped=0, lr=[4.381453430280132e-05, 4.381453430280132e-05, 4.381453430280132e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 96000/ 125429 | consumed samples: 24576000 | consumed tokens: 50331648000 | elapsed time per iteration (s): 1.04 | learning rate: 4.381E-05 | global batch size: 256 | lm loss: 1.940323E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.464 | TFLOPs: 40.56 | 0: steps: 96000 loss: 1.9266 iter time (s): 1.064 samples/sec: 240.582 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 96000 | lm loss value: 1.859991E+00 | lm loss PPL: 6.423679E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 96000 to checkpoints_1b5 0: [2022-11-27 00:28:47,651] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step96000 is begin to save! 0: [2022-11-27 00:28:47,661] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:28:47,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:28:47,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:28:48,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:28:48,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:28:48,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:28:48,160] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:28:48,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:28:48,269] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:28:48,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:28:48,385] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:28:48,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:28:48,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:28:48,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:28:48,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:28:48,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:28:48,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:28:48,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:28:48,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:28:48,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:28:48,973] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:28:49,088] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:28:49,089] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:28:49,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:28:49,210] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:28:49,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:28:49,328] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:28:49,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:28:49,444] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:28:49,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:28:49,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:28:49,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:28:49,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:28:49,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:28:49,792] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:28:49,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:28:49,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:28:50,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:28:50,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:28:50,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:28:50,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:28:50,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:28:50,250] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:28:50,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:28:50,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:28:50,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:28:50,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:28:50,594] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:28:50,595] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:28:50,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:28:50,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:28:50,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:28:50,824] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:28:50,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:28:50,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_29-model_00-model_states.pt... 0: [2022-11-27 00:28:51,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_29-model_00-model_states.pt. 0: [2022-11-27 00:28:51,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:28:51,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:28:51,162] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/layer_32-model_00-model_states.pt... 0: [2022-11-27 00:28:51,166] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/layer_32-model_00-model_states.pt. 0: [2022-11-27 00:28:51,167] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step96000/mp_rank_00_model_states.pt 0: [2022-11-27 00:28:51,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:28:51,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:28:51,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step96000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:28:51,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 00:28:51,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:28:51,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,390] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-27 00:28:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:28:51,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:28:51,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-27 00:28:51,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-27 00:28:51,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-27 00:28:51,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-27 00:28:51,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:28:51,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:28:51,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:28:51,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-27 00:28:51,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:28:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-27 00:28:51,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-27 00:28:51,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 00:28:51,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 00:28:51,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-27 00:28:51,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-27 00:28:51,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 1: [2022-11-27 00:28:51,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:28:51,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 12: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-27 00:28:51,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-27 00:28:51,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-27 00:28:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:28:51,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-27 00:28:51,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-27 00:28:51,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-27 00:28:51,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:28:51,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-27 00:28:51,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:28:51,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 3: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 4: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 2: [2022-11-27 00:28:51,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:28:51,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 00:28:51,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-27 00:28:51,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:28:51,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:28:51,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:28:51,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:28:51,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 11: [2022-11-27 00:28:51,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:28:51,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:28:51,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 12: [2022-11-27 00:28:51,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:28:51,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 00:28:51,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 10: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:28:51,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:28:51,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:28:51,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:28:51,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:28:51,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 7: [2022-11-27 00:28:51,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 00:28:51,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 00:28:51,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 00:28:51,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 00:28:51,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 1: [2022-11-27 00:28:51,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:28:51,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 00:28:51,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-27 00:28:51,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 13: [2022-11-27 00:28:51,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:28:51,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 00:28:51,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 14: [2022-11-27 00:28:51,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:28:51,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 00:28:51,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:28:51,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 00:28:51,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:28:51,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 00:28:51,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 00:28:51,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:28:51,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 00:28:51,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 8: [2022-11-27 00:28:51,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-27 00:28:51,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-27 00:28:51,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 15: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 6: [2022-11-27 00:28:51,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: [2022-11-27 00:28:51,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:28:51,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:28:51,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step96000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 00:28:51,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step96000 is ready now! 0: successfully saved checkpoint at iteration 96000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4066.47 15: iteration 96010/ 125429 | consumed samples: 24578560 | consumed tokens: 50336890880 | elapsed time per iteration (s): 1.51 | learning rate: 4.380E-05 | global batch size: 256 | lm loss: 1.911062E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 169.286 | TFLOPs: 27.98 | 15: iteration 96020/ 125429 | consumed samples: 24581120 | consumed tokens: 50342133760 | elapsed time per iteration (s): 1.02 | learning rate: 4.378E-05 | global batch size: 256 | lm loss: 1.936209E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.815 | TFLOPs: 41.28 | 15: iteration 96030/ 125429 | consumed samples: 24583680 | consumed tokens: 50347376640 | elapsed time per iteration (s): 1.03 | learning rate: 4.377E-05 | global batch size: 256 | lm loss: 1.938152E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.973 | TFLOPs: 40.98 | 15: iteration 96040/ 125429 | consumed samples: 24586240 | consumed tokens: 50352619520 | elapsed time per iteration (s): 1.02 | learning rate: 4.375E-05 | global batch size: 256 | lm loss: 1.913280E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.289 | TFLOPs: 41.36 | 15: iteration 96050/ 125429 | consumed samples: 24588800 | consumed tokens: 50357862400 | elapsed time per iteration (s): 1.08 | learning rate: 4.374E-05 | global batch size: 256 | lm loss: 1.919506E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.589 | TFLOPs: 39.10 | 15: iteration 96060/ 125429 | consumed samples: 24591360 | consumed tokens: 50363105280 | elapsed time per iteration (s): 1.04 | learning rate: 4.372E-05 | global batch size: 256 | lm loss: 1.888807E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.167 | TFLOPs: 40.85 | 15: iteration 96070/ 125429 | consumed samples: 24593920 | consumed tokens: 50368348160 | elapsed time per iteration (s): 1.04 | learning rate: 4.371E-05 | global batch size: 256 | lm loss: 1.917500E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.348 | TFLOPs: 40.71 | 15: iteration 96080/ 125429 | consumed samples: 24596480 | consumed tokens: 50373591040 | elapsed time per iteration (s): 1.05 | learning rate: 4.369E-05 | global batch size: 256 | lm loss: 1.914645E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.717 | TFLOPs: 40.44 | 15: iteration 96090/ 125429 | consumed samples: 24599040 | consumed tokens: 50378833920 | elapsed time per iteration (s): 1.05 | learning rate: 4.368E-05 | global batch size: 256 | lm loss: 1.917208E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.931 | TFLOPs: 40.31 | 15: iteration 96100/ 125429 | consumed samples: 24601600 | consumed tokens: 50384076800 | elapsed time per iteration (s): 1.05 | learning rate: 4.366E-05 | global batch size: 256 | lm loss: 1.932420E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.248 | TFLOPs: 40.20 | 15: iteration 96110/ 125429 | consumed samples: 24604160 | consumed tokens: 50389319680 | elapsed time per iteration (s): 1.08 | learning rate: 4.365E-05 | global batch size: 256 | lm loss: 1.918173E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.026 | TFLOPs: 39.34 | 15: iteration 96120/ 125429 | consumed samples: 24606720 | consumed tokens: 50394562560 | elapsed time per iteration (s): 1.04 | learning rate: 4.363E-05 | global batch size: 256 | lm loss: 1.940587E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.731 | TFLOPs: 40.77 | 15: iteration 96130/ 125429 | consumed samples: 24609280 | consumed tokens: 50399805440 | elapsed time per iteration (s): 1.04 | learning rate: 4.361E-05 | global batch size: 256 | lm loss: 1.918845E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.217 | TFLOPs: 40.52 | 15: iteration 96140/ 125429 | consumed samples: 24611840 | consumed tokens: 50405048320 | elapsed time per iteration (s): 1.08 | learning rate: 4.360E-05 | global batch size: 256 | lm loss: 1.899041E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.291 | TFLOPs: 39.05 | 15: iteration 96150/ 125429 | consumed samples: 24614400 | consumed tokens: 50410291200 | elapsed time per iteration (s): 1.05 | learning rate: 4.358E-05 | global batch size: 256 | lm loss: 1.919874E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.782 | TFLOPs: 40.45 | 15: iteration 96160/ 125429 | consumed samples: 24616960 | consumed tokens: 50415534080 | elapsed time per iteration (s): 1.04 | learning rate: 4.357E-05 | global batch size: 256 | lm loss: 1.910858E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.738 | TFLOPs: 40.78 | 15: iteration 96170/ 125429 | consumed samples: 24619520 | consumed tokens: 50420776960 | elapsed time per iteration (s): 1.06 | learning rate: 4.355E-05 | global batch size: 256 | lm loss: 1.913936E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.868 | TFLOPs: 39.81 | 15: iteration 96180/ 125429 | consumed samples: 24622080 | consumed tokens: 50426019840 | elapsed time per iteration (s): 1.06 | learning rate: 4.354E-05 | global batch size: 256 | lm loss: 1.896648E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.117 | TFLOPs: 40.01 | 15: iteration 96190/ 125429 | consumed samples: 24624640 | consumed tokens: 50431262720 | elapsed time per iteration (s): 1.05 | learning rate: 4.352E-05 | global batch size: 256 | lm loss: 1.887763E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.475 | TFLOPs: 40.24 | 15: iteration 96200/ 125429 | consumed samples: 24627200 | consumed tokens: 50436505600 | elapsed time per iteration (s): 1.02 | learning rate: 4.351E-05 | global batch size: 256 | lm loss: 1.899092E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.791 | TFLOPs: 41.28 | 15: iteration 96210/ 125429 | consumed samples: 24629760 | consumed tokens: 50441748480 | elapsed time per iteration (s): 1.03 | learning rate: 4.349E-05 | global batch size: 256 | lm loss: 1.937638E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.892 | TFLOPs: 40.97 | 15: iteration 96220/ 125429 | consumed samples: 24632320 | consumed tokens: 50446991360 | elapsed time per iteration (s): 1.04 | learning rate: 4.348E-05 | global batch size: 256 | lm loss: 1.898481E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.345 | TFLOPs: 40.55 | 15: iteration 96230/ 125429 | consumed samples: 24634880 | consumed tokens: 50452234240 | elapsed time per iteration (s): 1.06 | learning rate: 4.346E-05 | global batch size: 256 | lm loss: 1.917902E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.636 | TFLOPs: 39.77 | 15: iteration 96240/ 125429 | consumed samples: 24637440 | consumed tokens: 50457477120 | elapsed time per iteration (s): 1.04 | learning rate: 4.345E-05 | global batch size: 256 | lm loss: 1.911335E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.047 | TFLOPs: 40.50 | 15: iteration 96250/ 125429 | consumed samples: 24640000 | consumed tokens: 50462720000 | elapsed time per iteration (s): 1.06 | learning rate: 4.343E-05 | global batch size: 256 | lm loss: 1.908100E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.473 | TFLOPs: 39.74 | 15: iteration 96260/ 125429 | consumed samples: 24642560 | consumed tokens: 50467962880 | elapsed time per iteration (s): 1.03 | learning rate: 4.341E-05 | global batch size: 256 | lm loss: 1.919948E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.946 | TFLOPs: 40.97 | 15: iteration 96270/ 125429 | consumed samples: 24645120 | consumed tokens: 50473205760 | elapsed time per iteration (s): 1.04 | learning rate: 4.340E-05 | global batch size: 256 | lm loss: 1.901371E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.232 | TFLOPs: 40.69 | 15: iteration 96280/ 125429 | consumed samples: 24647680 | consumed tokens: 50478448640 | elapsed time per iteration (s): 1.03 | learning rate: 4.338E-05 | global batch size: 256 | lm loss: 1.916796E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.472 | TFLOPs: 41.06 | 15: iteration 96290/ 125429 | consumed samples: 24650240 | consumed tokens: 50483691520 | elapsed time per iteration (s): 1.03 | learning rate: 4.337E-05 | global batch size: 256 | lm loss: 1.894802E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.902 | TFLOPs: 40.97 | 15: iteration 96300/ 125429 | consumed samples: 24652800 | consumed tokens: 50488934400 | elapsed time per iteration (s): 1.04 | learning rate: 4.335E-05 | global batch size: 256 | lm loss: 1.895615E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.658 | TFLOPs: 40.60 | 15: iteration 96310/ 125429 | consumed samples: 24655360 | consumed tokens: 50494177280 | elapsed time per iteration (s): 1.02 | learning rate: 4.334E-05 | global batch size: 256 | lm loss: 1.917298E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.172 | TFLOPs: 41.34 | 15: iteration 96320/ 125429 | consumed samples: 24657920 | consumed tokens: 50499420160 | elapsed time per iteration (s): 1.03 | learning rate: 4.332E-05 | global batch size: 256 | lm loss: 1.917979E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.376 | TFLOPs: 40.88 | 15: iteration 96330/ 125429 | consumed samples: 24660480 | consumed tokens: 50504663040 | elapsed time per iteration (s): 1.03 | learning rate: 4.331E-05 | global batch size: 256 | lm loss: 1.938904E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.917 | TFLOPs: 41.14 | 15: iteration 96340/ 125429 | consumed samples: 24663040 | consumed tokens: 50509905920 | elapsed time per iteration (s): 1.04 | learning rate: 4.329E-05 | global batch size: 256 | lm loss: 1.908401E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.717 | TFLOPs: 40.77 | 15: iteration 96350/ 125429 | consumed samples: 24665600 | consumed tokens: 50515148800 | elapsed time per iteration (s): 1.02 | learning rate: 4.328E-05 | global batch size: 256 | lm loss: 1.902473E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.761 | TFLOPs: 41.44 | 15: iteration 96360/ 125429 | consumed samples: 24668160 | consumed tokens: 50520391680 | elapsed time per iteration (s): 1.03 | learning rate: 4.326E-05 | global batch size: 256 | lm loss: 1.915860E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.276 | TFLOPs: 41.19 | 15: iteration 96370/ 125429 | consumed samples: 24670720 | consumed tokens: 50525634560 | elapsed time per iteration (s): 1.05 | learning rate: 4.325E-05 | global batch size: 256 | lm loss: 1.931611E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.843 | TFLOPs: 40.30 | 15: iteration 96380/ 125429 | consumed samples: 24673280 | consumed tokens: 50530877440 | elapsed time per iteration (s): 1.02 | learning rate: 4.323E-05 | global batch size: 256 | lm loss: 1.891061E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.638 | TFLOPs: 41.42 | 15: iteration 96390/ 125429 | consumed samples: 24675840 | consumed tokens: 50536120320 | elapsed time per iteration (s): 1.02 | learning rate: 4.322E-05 | global batch size: 256 | lm loss: 1.924340E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.146 | TFLOPs: 41.34 | 15: iteration 96400/ 125429 | consumed samples: 24678400 | consumed tokens: 50541363200 | elapsed time per iteration (s): 1.05 | learning rate: 4.320E-05 | global batch size: 256 | lm loss: 1.917023E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.720 | TFLOPs: 40.44 | 15: iteration 96410/ 125429 | consumed samples: 24680960 | consumed tokens: 50546606080 | elapsed time per iteration (s): 1.04 | learning rate: 4.319E-05 | global batch size: 256 | lm loss: 1.915928E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.891 | TFLOPs: 40.80 | 15: iteration 96420/ 125429 | consumed samples: 24683520 | consumed tokens: 50551848960 | elapsed time per iteration (s): 1.04 | learning rate: 4.317E-05 | global batch size: 256 | lm loss: 1.898753E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.933 | TFLOPs: 40.64 | 15: iteration 96430/ 125429 | consumed samples: 24686080 | consumed tokens: 50557091840 | elapsed time per iteration (s): 1.07 | learning rate: 4.315E-05 | global batch size: 256 | lm loss: 1.938155E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.302 | TFLOPs: 39.55 | 15: iteration 96440/ 125429 | consumed samples: 24688640 | consumed tokens: 50562334720 | elapsed time per iteration (s): 1.07 | learning rate: 4.314E-05 | global batch size: 256 | lm loss: 1.896978E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.448 | TFLOPs: 39.57 | 15: iteration 96450/ 125429 | consumed samples: 24691200 | consumed tokens: 50567577600 | elapsed time per iteration (s): 1.06 | learning rate: 4.312E-05 | global batch size: 256 | lm loss: 1.925044E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.496 | TFLOPs: 40.07 | 15: iteration 96460/ 125429 | consumed samples: 24693760 | consumed tokens: 50572820480 | elapsed time per iteration (s): 1.02 | learning rate: 4.311E-05 | global batch size: 256 | lm loss: 1.920503E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.646 | TFLOPs: 41.59 | 15: iteration 96470/ 125429 | consumed samples: 24696320 | consumed tokens: 50578063360 | elapsed time per iteration (s): 1.04 | learning rate: 4.309E-05 | global batch size: 256 | lm loss: 1.902869E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.010 | TFLOPs: 40.82 | 15: iteration 96480/ 125429 | consumed samples: 24698880 | consumed tokens: 50583306240 | elapsed time per iteration (s): 1.03 | learning rate: 4.308E-05 | global batch size: 256 | lm loss: 1.909695E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.091 | TFLOPs: 41.00 | 15: iteration 96490/ 125429 | consumed samples: 24701440 | consumed tokens: 50588549120 | elapsed time per iteration (s): 1.05 | learning rate: 4.306E-05 | global batch size: 256 | lm loss: 1.907138E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.074 | TFLOPs: 40.17 | 15: iteration 96500/ 125429 | consumed samples: 24704000 | consumed tokens: 50593792000 | elapsed time per iteration (s): 1.02 | learning rate: 4.305E-05 | global batch size: 256 | lm loss: 1.946129E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.298 | TFLOPs: 41.53 | 15: iteration 96510/ 125429 | consumed samples: 24706560 | consumed tokens: 50599034880 | elapsed time per iteration (s): 1.04 | learning rate: 4.303E-05 | global batch size: 256 | lm loss: 1.927556E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.250 | TFLOPs: 40.69 | 15: iteration 96520/ 125429 | consumed samples: 24709120 | consumed tokens: 50604277760 | elapsed time per iteration (s): 1.04 | learning rate: 4.302E-05 | global batch size: 256 | lm loss: 1.922597E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.159 | TFLOPs: 40.51 | 15: iteration 96530/ 125429 | consumed samples: 24711680 | consumed tokens: 50609520640 | elapsed time per iteration (s): 1.03 | learning rate: 4.300E-05 | global batch size: 256 | lm loss: 1.920003E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.879 | TFLOPs: 41.13 | 15: iteration 96540/ 125429 | consumed samples: 24714240 | consumed tokens: 50614763520 | elapsed time per iteration (s): 1.03 | learning rate: 4.299E-05 | global batch size: 256 | lm loss: 1.924078E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.811 | TFLOPs: 41.12 | 15: iteration 96550/ 125429 | consumed samples: 24716800 | consumed tokens: 50620006400 | elapsed time per iteration (s): 1.04 | learning rate: 4.297E-05 | global batch size: 256 | lm loss: 1.917175E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.226 | TFLOPs: 40.86 | 15: iteration 96560/ 125429 | consumed samples: 24719360 | consumed tokens: 50625249280 | elapsed time per iteration (s): 1.04 | learning rate: 4.296E-05 | global batch size: 256 | lm loss: 1.929714E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.018 | TFLOPs: 40.66 | 15: iteration 96570/ 125429 | consumed samples: 24721920 | consumed tokens: 50630492160 | elapsed time per iteration (s): 1.03 | learning rate: 4.294E-05 | global batch size: 256 | lm loss: 1.908124E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.364 | TFLOPs: 41.04 | 15: iteration 96580/ 125429 | consumed samples: 24724480 | consumed tokens: 50635735040 | elapsed time per iteration (s): 1.03 | learning rate: 4.293E-05 | global batch size: 256 | lm loss: 1.900966E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.694 | TFLOPs: 41.10 | 15: iteration 96590/ 125429 | consumed samples: 24727040 | consumed tokens: 50640977920 | elapsed time per iteration (s): 1.02 | learning rate: 4.291E-05 | global batch size: 256 | lm loss: 1.887870E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.031 | TFLOPs: 41.32 | 15: iteration 96600/ 125429 | consumed samples: 24729600 | consumed tokens: 50646220800 | elapsed time per iteration (s): 1.04 | learning rate: 4.290E-05 | global batch size: 256 | lm loss: 1.920097E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.738 | TFLOPs: 40.78 | 15: iteration 96610/ 125429 | consumed samples: 24732160 | consumed tokens: 50651463680 | elapsed time per iteration (s): 1.04 | learning rate: 4.288E-05 | global batch size: 256 | lm loss: 1.943685E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.973 | TFLOPs: 40.65 | 15: iteration 96620/ 125429 | consumed samples: 24734720 | consumed tokens: 50656706560 | elapsed time per iteration (s): 1.05 | learning rate: 4.287E-05 | global batch size: 256 | lm loss: 1.930815E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.970 | TFLOPs: 40.15 | 15: iteration 96630/ 125429 | consumed samples: 24737280 | consumed tokens: 50661949440 | elapsed time per iteration (s): 1.03 | learning rate: 4.285E-05 | global batch size: 256 | lm loss: 1.903305E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.751 | TFLOPs: 41.27 | 15: iteration 96640/ 125429 | consumed samples: 24739840 | consumed tokens: 50667192320 | elapsed time per iteration (s): 1.04 | learning rate: 4.284E-05 | global batch size: 256 | lm loss: 1.905699E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.171 | TFLOPs: 40.85 | 15: iteration 96650/ 125429 | consumed samples: 24742400 | consumed tokens: 50672435200 | elapsed time per iteration (s): 1.05 | learning rate: 4.282E-05 | global batch size: 256 | lm loss: 1.921673E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.585 | TFLOPs: 40.25 | 15: iteration 96660/ 125429 | consumed samples: 24744960 | consumed tokens: 50677678080 | elapsed time per iteration (s): 1.05 | learning rate: 4.281E-05 | global batch size: 256 | lm loss: 1.938156E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.014 | TFLOPs: 40.33 | 15: iteration 96670/ 125429 | consumed samples: 24747520 | consumed tokens: 50682920960 | elapsed time per iteration (s): 1.04 | learning rate: 4.279E-05 | global batch size: 256 | lm loss: 1.901771E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.218 | TFLOPs: 40.52 | 15: iteration 96680/ 125429 | consumed samples: 24750080 | consumed tokens: 50688163840 | elapsed time per iteration (s): 1.05 | learning rate: 4.278E-05 | global batch size: 256 | lm loss: 1.927131E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.679 | TFLOPs: 40.27 | 15: iteration 96690/ 125429 | consumed samples: 24752640 | consumed tokens: 50693406720 | elapsed time per iteration (s): 1.06 | learning rate: 4.276E-05 | global batch size: 256 | lm loss: 1.909188E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.864 | TFLOPs: 39.80 | 15: iteration 96700/ 125429 | consumed samples: 24755200 | consumed tokens: 50698649600 | elapsed time per iteration (s): 1.05 | learning rate: 4.274E-05 | global batch size: 256 | lm loss: 1.920349E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.555 | TFLOPs: 40.41 | 15: iteration 96710/ 125429 | consumed samples: 24757760 | consumed tokens: 50703892480 | elapsed time per iteration (s): 1.04 | learning rate: 4.273E-05 | global batch size: 256 | lm loss: 1.925748E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.449 | TFLOPs: 40.56 | 15: iteration 96720/ 125429 | consumed samples: 24760320 | consumed tokens: 50709135360 | elapsed time per iteration (s): 1.07 | learning rate: 4.271E-05 | global batch size: 256 | lm loss: 1.911407E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.410 | TFLOPs: 39.40 | 15: iteration 96730/ 125429 | consumed samples: 24762880 | consumed tokens: 50714378240 | elapsed time per iteration (s): 1.03 | learning rate: 4.270E-05 | global batch size: 256 | lm loss: 1.910942E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.513 | TFLOPs: 41.07 | 15: iteration 96740/ 125429 | consumed samples: 24765440 | consumed tokens: 50719621120 | elapsed time per iteration (s): 1.06 | learning rate: 4.268E-05 | global batch size: 256 | lm loss: 1.887539E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.494 | TFLOPs: 40.07 | 15: iteration 96750/ 125429 | consumed samples: 24768000 | consumed tokens: 50724864000 | elapsed time per iteration (s): 1.03 | learning rate: 4.267E-05 | global batch size: 256 | lm loss: 1.924304E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.432 | TFLOPs: 41.22 | 15: iteration 96760/ 125429 | consumed samples: 24770560 | consumed tokens: 50730106880 | elapsed time per iteration (s): 1.03 | learning rate: 4.265E-05 | global batch size: 256 | lm loss: 1.862908E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.422 | TFLOPs: 40.89 | 15: iteration 96770/ 125429 | consumed samples: 24773120 | consumed tokens: 50735349760 | elapsed time per iteration (s): 1.03 | learning rate: 4.264E-05 | global batch size: 256 | lm loss: 1.925547E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.536 | TFLOPs: 41.07 | 15: iteration 96780/ 125429 | consumed samples: 24775680 | consumed tokens: 50740592640 | elapsed time per iteration (s): 1.04 | learning rate: 4.262E-05 | global batch size: 256 | lm loss: 1.903529E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.238 | TFLOPs: 40.69 | 15: iteration 96790/ 125429 | consumed samples: 24778240 | consumed tokens: 50745835520 | elapsed time per iteration (s): 1.02 | learning rate: 4.261E-05 | global batch size: 256 | lm loss: 1.919607E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.000 | TFLOPs: 41.48 | 15: iteration 96800/ 125429 | consumed samples: 24780800 | consumed tokens: 50751078400 | elapsed time per iteration (s): 1.04 | learning rate: 4.259E-05 | global batch size: 256 | lm loss: 1.898283E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.017 | TFLOPs: 40.66 | 15: iteration 96810/ 125429 | consumed samples: 24783360 | consumed tokens: 50756321280 | elapsed time per iteration (s): 1.03 | learning rate: 4.258E-05 | global batch size: 256 | lm loss: 1.900632E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.720 | TFLOPs: 41.27 | 15: iteration 96820/ 125429 | consumed samples: 24785920 | consumed tokens: 50761564160 | elapsed time per iteration (s): 1.03 | learning rate: 4.256E-05 | global batch size: 256 | lm loss: 1.905032E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.222 | TFLOPs: 41.02 | 15: iteration 96830/ 125429 | consumed samples: 24788480 | consumed tokens: 50766807040 | elapsed time per iteration (s): 1.03 | learning rate: 4.255E-05 | global batch size: 256 | lm loss: 1.916784E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.515 | TFLOPs: 41.23 | 15: iteration 96840/ 125429 | consumed samples: 24791040 | consumed tokens: 50772049920 | elapsed time per iteration (s): 1.03 | learning rate: 4.253E-05 | global batch size: 256 | lm loss: 1.922808E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.756 | TFLOPs: 41.11 | 15: iteration 96850/ 125429 | consumed samples: 24793600 | consumed tokens: 50777292800 | elapsed time per iteration (s): 1.07 | learning rate: 4.252E-05 | global batch size: 256 | lm loss: 1.937506E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.278 | TFLOPs: 39.54 | 15: iteration 96860/ 125429 | consumed samples: 24796160 | consumed tokens: 50782535680 | elapsed time per iteration (s): 1.03 | learning rate: 4.250E-05 | global batch size: 256 | lm loss: 1.906858E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.464 | TFLOPs: 41.06 | 15: iteration 96870/ 125429 | consumed samples: 24798720 | consumed tokens: 50787778560 | elapsed time per iteration (s): 1.05 | learning rate: 4.249E-05 | global batch size: 256 | lm loss: 1.900397E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.152 | TFLOPs: 40.35 | 15: iteration 96880/ 125429 | consumed samples: 24801280 | consumed tokens: 50793021440 | elapsed time per iteration (s): 1.04 | learning rate: 4.247E-05 | global batch size: 256 | lm loss: 1.897973E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.091 | TFLOPs: 40.50 | 15: iteration 96890/ 125429 | consumed samples: 24803840 | consumed tokens: 50798264320 | elapsed time per iteration (s): 1.03 | learning rate: 4.246E-05 | global batch size: 256 | lm loss: 1.904093E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.047 | TFLOPs: 40.99 | 15: iteration 96900/ 125429 | consumed samples: 24806400 | consumed tokens: 50803507200 | elapsed time per iteration (s): 1.02 | learning rate: 4.244E-05 | global batch size: 256 | lm loss: 1.896989E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.380 | TFLOPs: 41.38 | 15: iteration 96910/ 125429 | consumed samples: 24808960 | consumed tokens: 50808750080 | elapsed time per iteration (s): 1.05 | learning rate: 4.243E-05 | global batch size: 256 | lm loss: 1.938392E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.641 | TFLOPs: 40.26 | 15: iteration 96920/ 125429 | consumed samples: 24811520 | consumed tokens: 50813992960 | elapsed time per iteration (s): 1.02 | learning rate: 4.241E-05 | global batch size: 256 | lm loss: 1.902627E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.313 | TFLOPs: 41.37 | 15: iteration 96930/ 125429 | consumed samples: 24814080 | consumed tokens: 50819235840 | elapsed time per iteration (s): 1.03 | learning rate: 4.240E-05 | global batch size: 256 | lm loss: 1.918954E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.024 | TFLOPs: 40.99 | 15: iteration 96940/ 125429 | consumed samples: 24816640 | consumed tokens: 50824478720 | elapsed time per iteration (s): 1.06 | learning rate: 4.238E-05 | global batch size: 256 | lm loss: 1.898895E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.795 | TFLOPs: 39.79 | 15: iteration 96950/ 125429 | consumed samples: 24819200 | consumed tokens: 50829721600 | elapsed time per iteration (s): 1.05 | learning rate: 4.237E-05 | global batch size: 256 | lm loss: 1.899943E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.552 | TFLOPs: 40.25 | 15: iteration 96960/ 125429 | consumed samples: 24821760 | consumed tokens: 50834964480 | elapsed time per iteration (s): 1.04 | learning rate: 4.235E-05 | global batch size: 256 | lm loss: 1.915350E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.489 | TFLOPs: 40.57 | 15: iteration 96970/ 125429 | consumed samples: 24824320 | consumed tokens: 50840207360 | elapsed time per iteration (s): 1.03 | learning rate: 4.234E-05 | global batch size: 256 | lm loss: 1.917772E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.593 | TFLOPs: 41.25 | 15: iteration 96980/ 125429 | consumed samples: 24826880 | consumed tokens: 50845450240 | elapsed time per iteration (s): 1.05 | learning rate: 4.232E-05 | global batch size: 256 | lm loss: 1.932574E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.498 | TFLOPs: 40.41 | 15: iteration 96990/ 125429 | consumed samples: 24829440 | consumed tokens: 50850693120 | elapsed time per iteration (s): 1.04 | learning rate: 4.231E-05 | global batch size: 256 | lm loss: 1.908327E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.546 | TFLOPs: 40.74 | 15: iteration 97000/ 125429 | consumed samples: 24832000 | consumed tokens: 50855936000 | elapsed time per iteration (s): 1.05 | learning rate: 4.229E-05 | global batch size: 256 | lm loss: 1.940444E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.669 | TFLOPs: 40.27 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 97000 | lm loss value: 1.888625E+00 | lm loss PPL: 6.610273E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 97000 to checkpoints_1b5 0: [2022-11-27 00:46:13,251] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step97000 is begin to save! 0: [2022-11-27 00:46:13,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_01-model_00-model_states.pt... 0: [2022-11-27 00:46:13,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_01-model_00-model_states.pt. 0: [2022-11-27 00:46:13,510] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_03-model_00-model_states.pt... 0: [2022-11-27 00:46:13,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_03-model_00-model_states.pt. 0: [2022-11-27 00:46:13,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_04-model_00-model_states.pt... 0: [2022-11-27 00:46:13,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_04-model_00-model_states.pt. 0: [2022-11-27 00:46:13,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_05-model_00-model_states.pt... 0: [2022-11-27 00:46:13,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_05-model_00-model_states.pt. 0: [2022-11-27 00:46:13,832] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_06-model_00-model_states.pt... 0: [2022-11-27 00:46:13,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_06-model_00-model_states.pt. 0: [2022-11-27 00:46:13,938] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_07-model_00-model_states.pt... 0: [2022-11-27 00:46:14,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_07-model_00-model_states.pt. 0: [2022-11-27 00:46:14,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_08-model_00-model_states.pt... 0: [2022-11-27 00:46:14,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_08-model_00-model_states.pt. 0: [2022-11-27 00:46:14,153] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_09-model_00-model_states.pt... 0: [2022-11-27 00:46:14,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_09-model_00-model_states.pt. 0: [2022-11-27 00:46:14,257] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_10-model_00-model_states.pt... 0: [2022-11-27 00:46:14,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_10-model_00-model_states.pt. 0: [2022-11-27 00:46:14,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_11-model_00-model_states.pt... 0: [2022-11-27 00:46:14,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_11-model_00-model_states.pt. 0: [2022-11-27 00:46:14,469] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_12-model_00-model_states.pt... 0: [2022-11-27 00:46:14,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_12-model_00-model_states.pt. 0: [2022-11-27 00:46:14,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_13-model_00-model_states.pt... 0: [2022-11-27 00:46:14,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_13-model_00-model_states.pt. 0: [2022-11-27 00:46:14,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_14-model_00-model_states.pt... 0: [2022-11-27 00:46:14,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_14-model_00-model_states.pt. 0: [2022-11-27 00:46:14,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_15-model_00-model_states.pt... 0: [2022-11-27 00:46:14,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_15-model_00-model_states.pt. 0: [2022-11-27 00:46:14,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_16-model_00-model_states.pt... 0: [2022-11-27 00:46:14,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_16-model_00-model_states.pt. 0: [2022-11-27 00:46:14,992] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_17-model_00-model_states.pt... 0: [2022-11-27 00:46:15,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_17-model_00-model_states.pt. 0: [2022-11-27 00:46:15,098] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_18-model_00-model_states.pt... 0: [2022-11-27 00:46:15,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_18-model_00-model_states.pt. 0: [2022-11-27 00:46:15,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_19-model_00-model_states.pt... 0: [2022-11-27 00:46:15,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_19-model_00-model_states.pt. 0: [2022-11-27 00:46:15,308] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_20-model_00-model_states.pt... 0: [2022-11-27 00:46:15,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_20-model_00-model_states.pt. 0: [2022-11-27 00:46:15,411] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_21-model_00-model_states.pt... 0: [2022-11-27 00:46:15,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_21-model_00-model_states.pt. 0: [2022-11-27 00:46:15,515] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_22-model_00-model_states.pt... 0: [2022-11-27 00:46:15,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_22-model_00-model_states.pt. 0: [2022-11-27 00:46:15,620] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_23-model_00-model_states.pt... 0: [2022-11-27 00:46:15,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_23-model_00-model_states.pt. 0: [2022-11-27 00:46:15,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_24-model_00-model_states.pt... 0: [2022-11-27 00:46:15,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_24-model_00-model_states.pt. 0: [2022-11-27 00:46:15,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_25-model_00-model_states.pt... 0: [2022-11-27 00:46:15,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_25-model_00-model_states.pt. 0: [2022-11-27 00:46:15,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_26-model_00-model_states.pt... 0: [2022-11-27 00:46:16,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_26-model_00-model_states.pt. 0: [2022-11-27 00:46:16,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_27-model_00-model_states.pt... 0: [2022-11-27 00:46:16,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_27-model_00-model_states.pt. 0: [2022-11-27 00:46:16,136] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_28-model_00-model_states.pt... 0: [2022-11-27 00:46:16,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_28-model_00-model_states.pt. 0: [2022-11-27 00:46:16,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_29-model_00-model_states.pt... 0: [2022-11-27 00:46:16,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_29-model_00-model_states.pt. 0: [2022-11-27 00:46:16,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_30-model_00-model_states.pt... 0: [2022-11-27 00:46:16,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_30-model_00-model_states.pt. 0: [2022-11-27 00:46:16,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/layer_32-model_00-model_states.pt... 0: [2022-11-27 00:46:16,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/layer_32-model_00-model_states.pt. 0: [2022-11-27 00:46:16,458] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step97000/mp_rank_00_model_states.pt 0: [2022-11-27 00:46:16,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/mp_rank_00_model_states.pt... 0: [2022-11-27 00:46:16,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/mp_rank_00_model_states.pt. 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 00:46:16,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step97000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 15: [2022-11-27 00:46:16,658] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 00:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 00:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:46:16,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:46:16,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:46:16,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:46:16,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 00:46:16,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-27 00:46:16,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-27 00:46:16,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:46:16,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-27 00:46:16,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-27 00:46:16,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:46:16,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 00:46:16,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 00:46:16,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-27 00:46:16,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,675] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,675] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 00:46:16,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 00:46:16,677] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 00:46:16,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,677] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:46:16,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-27 00:46:16,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-27 00:46:16,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-27 00:46:16,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:46:16,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:46:16,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:46:16,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:46:16,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 6: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:46:16,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:46:16,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:46:16,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:46:16,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 3: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 00:46:16,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:46:16,690] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,690] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-27 00:46:16,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-27 00:46:16,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-27 00:46:16,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-27 00:46:16,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-27 00:46:16,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 9: [2022-11-27 00:46:16,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 00:46:16,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 00:46:16,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,706] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 2: [2022-11-27 00:46:16,706] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 00:46:16,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 00:46:16,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 7: [2022-11-27 00:46:16,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 00:46:16,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 00:46:16,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-27 00:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-27 00:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-27 00:46:16,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-27 00:46:16,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 8: [2022-11-27 00:46:16,695] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 00:46:16,695] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 00:46:16,695] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-27 00:46:16,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-27 00:46:16,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 5: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 5: [2022-11-27 00:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 5: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-27 00:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:46:16,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:46:16,710] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,710] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 15: [2022-11-27 00:46:16,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 00:46:16,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 00:46:16,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:46:16,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 4: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 00:46:16,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 00:46:16,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-27 00:46:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-27 00:46:16,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-27 00:46:16,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 12: [2022-11-27 00:46:16,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 00:46:16,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 00:46:16,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 1: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 14: [2022-11-27 00:46:16,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 00:46:16,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 00:46:16,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,680] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,680] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 13: [2022-11-27 00:46:16,707] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 00:46:16,707] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 00:46:16,707] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: [2022-11-27 00:46:16,756] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 00:46:16,757] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 11: [2022-11-27 00:46:16,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 00:46:16,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 00:46:16,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step97000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 10: [2022-11-27 00:46:16,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step97000 is ready now! 0: successfully saved checkpoint at iteration 97000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3575.90 15: iteration 97010/ 125429 | consumed samples: 24834560 | consumed tokens: 50861178880 | elapsed time per iteration (s): 1.43 | learning rate: 4.228E-05 | global batch size: 256 | lm loss: 1.934233E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.181 | TFLOPs: 29.61 | 15: iteration 97020/ 125429 | consumed samples: 24837120 | consumed tokens: 50866421760 | elapsed time per iteration (s): 1.04 | learning rate: 4.226E-05 | global batch size: 256 | lm loss: 1.898391E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.010 | TFLOPs: 40.66 | 15: iteration 97030/ 125429 | consumed samples: 24839680 | consumed tokens: 50871664640 | elapsed time per iteration (s): 1.02 | learning rate: 4.225E-05 | global batch size: 256 | lm loss: 1.904383E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.605 | TFLOPs: 41.58 | 15: iteration 97040/ 125429 | consumed samples: 24842240 | consumed tokens: 50876907520 | elapsed time per iteration (s): 1.03 | learning rate: 4.223E-05 | global batch size: 256 | lm loss: 1.897494E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.858 | TFLOPs: 41.13 | 15: iteration 97050/ 125429 | consumed samples: 24844800 | consumed tokens: 50882150400 | elapsed time per iteration (s): 1.06 | learning rate: 4.222E-05 | global batch size: 256 | lm loss: 1.907003E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.233 | TFLOPs: 39.87 | 15: iteration 97060/ 125429 | consumed samples: 24847360 | consumed tokens: 50887393280 | elapsed time per iteration (s): 1.07 | learning rate: 4.220E-05 | global batch size: 256 | lm loss: 1.909705E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.328 | TFLOPs: 39.55 | 15: iteration 97070/ 125429 | consumed samples: 24849920 | consumed tokens: 50892636160 | elapsed time per iteration (s): 1.04 | learning rate: 4.219E-05 | global batch size: 256 | lm loss: 1.924217E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.163 | TFLOPs: 40.52 | 15: iteration 97080/ 125429 | consumed samples: 24852480 | consumed tokens: 50897879040 | elapsed time per iteration (s): 1.04 | learning rate: 4.217E-05 | global batch size: 256 | lm loss: 1.914358E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.800 | TFLOPs: 40.79 | 15: iteration 97090/ 125429 | consumed samples: 24855040 | consumed tokens: 50903121920 | elapsed time per iteration (s): 1.03 | learning rate: 4.216E-05 | global batch size: 256 | lm loss: 1.938350E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.885 | TFLOPs: 41.13 | 15: iteration 97100/ 125429 | consumed samples: 24857600 | consumed tokens: 50908364800 | elapsed time per iteration (s): 1.14 | learning rate: 4.214E-05 | global batch size: 256 | lm loss: 1.900943E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.985 | TFLOPs: 37.02 | 15: iteration 97110/ 125429 | consumed samples: 24860160 | consumed tokens: 50913607680 | elapsed time per iteration (s): 1.03 | learning rate: 4.213E-05 | global batch size: 256 | lm loss: 1.909156E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.003 | TFLOPs: 40.98 | 15: iteration 97120/ 125429 | consumed samples: 24862720 | consumed tokens: 50918850560 | elapsed time per iteration (s): 1.06 | learning rate: 4.211E-05 | global batch size: 256 | lm loss: 1.916038E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.840 | TFLOPs: 39.97 | 15: iteration 97130/ 125429 | consumed samples: 24865280 | consumed tokens: 50924093440 | elapsed time per iteration (s): 1.05 | learning rate: 4.210E-05 | global batch size: 256 | lm loss: 1.933477E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.927 | TFLOPs: 40.15 | 15: iteration 97140/ 125429 | consumed samples: 24867840 | consumed tokens: 50929336320 | elapsed time per iteration (s): 1.07 | learning rate: 4.208E-05 | global batch size: 256 | lm loss: 1.916261E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.464 | TFLOPs: 39.41 | 15: iteration 97150/ 125429 | consumed samples: 24870400 | consumed tokens: 50934579200 | elapsed time per iteration (s): 1.03 | learning rate: 4.207E-05 | global batch size: 256 | lm loss: 1.943800E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.821 | TFLOPs: 40.95 | 15: iteration 97160/ 125429 | consumed samples: 24872960 | consumed tokens: 50939822080 | elapsed time per iteration (s): 1.05 | learning rate: 4.205E-05 | global batch size: 256 | lm loss: 1.942792E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.967 | TFLOPs: 40.15 | 15: iteration 97170/ 125429 | consumed samples: 24875520 | consumed tokens: 50945064960 | elapsed time per iteration (s): 1.03 | learning rate: 4.204E-05 | global batch size: 256 | lm loss: 1.934240E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.401 | TFLOPs: 41.05 | 15: iteration 97180/ 125429 | consumed samples: 24878080 | consumed tokens: 50950307840 | elapsed time per iteration (s): 1.02 | learning rate: 4.202E-05 | global batch size: 256 | lm loss: 1.942212E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.246 | TFLOPs: 41.36 | 15: iteration 97190/ 125429 | consumed samples: 24880640 | consumed tokens: 50955550720 | elapsed time per iteration (s): 1.03 | learning rate: 4.201E-05 | global batch size: 256 | lm loss: 1.890114E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.392 | TFLOPs: 41.21 | 15: iteration 97200/ 125429 | consumed samples: 24883200 | consumed tokens: 50960793600 | elapsed time per iteration (s): 1.03 | learning rate: 4.199E-05 | global batch size: 256 | lm loss: 1.904836E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.008 | TFLOPs: 41.15 | 15: iteration 97210/ 125429 | consumed samples: 24885760 | consumed tokens: 50966036480 | elapsed time per iteration (s): 1.03 | learning rate: 4.198E-05 | global batch size: 256 | lm loss: 1.918653E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.447 | TFLOPs: 40.89 | 15: iteration 97220/ 125429 | consumed samples: 24888320 | consumed tokens: 50971279360 | elapsed time per iteration (s): 1.05 | learning rate: 4.196E-05 | global batch size: 256 | lm loss: 1.873898E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.091 | TFLOPs: 40.34 | 15: iteration 97230/ 125429 | consumed samples: 24890880 | consumed tokens: 50976522240 | elapsed time per iteration (s): 1.02 | learning rate: 4.195E-05 | global batch size: 256 | lm loss: 1.895244E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.708 | TFLOPs: 41.60 | 15: iteration 97240/ 125429 | consumed samples: 24893440 | consumed tokens: 50981765120 | elapsed time per iteration (s): 1.03 | learning rate: 4.193E-05 | global batch size: 256 | lm loss: 1.899923E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.472 | TFLOPs: 40.90 | 15: iteration 97250/ 125429 | consumed samples: 24896000 | consumed tokens: 50987008000 | elapsed time per iteration (s): 1.02 | learning rate: 4.192E-05 | global batch size: 256 | lm loss: 1.925327E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.432 | TFLOPs: 41.39 | 15: iteration 97260/ 125429 | consumed samples: 24898560 | consumed tokens: 50992250880 | elapsed time per iteration (s): 1.03 | learning rate: 4.190E-05 | global batch size: 256 | lm loss: 1.889671E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.131 | TFLOPs: 41.01 | 15: iteration 97270/ 125429 | consumed samples: 24901120 | consumed tokens: 50997493760 | elapsed time per iteration (s): 1.02 | learning rate: 4.189E-05 | global batch size: 256 | lm loss: 1.916495E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.033 | TFLOPs: 41.32 | 15: iteration 97280/ 125429 | consumed samples: 24903680 | consumed tokens: 51002736640 | elapsed time per iteration (s): 1.03 | learning rate: 4.187E-05 | global batch size: 256 | lm loss: 1.911160E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.969 | TFLOPs: 40.98 | 15: iteration 97290/ 125429 | consumed samples: 24906240 | consumed tokens: 51007979520 | elapsed time per iteration (s): 1.04 | learning rate: 4.186E-05 | global batch size: 256 | lm loss: 1.897654E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.237 | TFLOPs: 40.53 | 15: iteration 97300/ 125429 | consumed samples: 24908800 | consumed tokens: 51013222400 | elapsed time per iteration (s): 1.04 | learning rate: 4.184E-05 | global batch size: 256 | lm loss: 1.898070E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.335 | TFLOPs: 40.71 | 15: iteration 97310/ 125429 | consumed samples: 24911360 | consumed tokens: 51018465280 | elapsed time per iteration (s): 1.04 | learning rate: 4.183E-05 | global batch size: 256 | lm loss: 1.925135E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.565 | TFLOPs: 40.75 | 15: iteration 97320/ 125429 | consumed samples: 24913920 | consumed tokens: 51023708160 | elapsed time per iteration (s): 1.04 | learning rate: 4.182E-05 | global batch size: 256 | lm loss: 1.905926E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.970 | TFLOPs: 40.65 | 15: iteration 97330/ 125429 | consumed samples: 24916480 | consumed tokens: 51028951040 | elapsed time per iteration (s): 1.03 | learning rate: 4.180E-05 | global batch size: 256 | lm loss: 1.904514E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.236 | TFLOPs: 41.02 | 15: iteration 97340/ 125429 | consumed samples: 24919040 | consumed tokens: 51034193920 | elapsed time per iteration (s): 1.05 | learning rate: 4.179E-05 | global batch size: 256 | lm loss: 1.911271E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.855 | TFLOPs: 40.46 | 15: iteration 97350/ 125429 | consumed samples: 24921600 | consumed tokens: 51039436800 | elapsed time per iteration (s): 1.03 | learning rate: 4.177E-05 | global batch size: 256 | lm loss: 1.919108E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.645 | TFLOPs: 41.09 | 15: iteration 97360/ 125429 | consumed samples: 24924160 | consumed tokens: 51044679680 | elapsed time per iteration (s): 1.03 | learning rate: 4.176E-05 | global batch size: 256 | lm loss: 1.910263E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.570 | TFLOPs: 40.91 | 15: iteration 97370/ 125429 | consumed samples: 24926720 | consumed tokens: 51049922560 | elapsed time per iteration (s): 1.03 | learning rate: 4.174E-05 | global batch size: 256 | lm loss: 1.903315E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.942 | TFLOPs: 41.14 | 15: iteration 97380/ 125429 | consumed samples: 24929280 | consumed tokens: 51055165440 | elapsed time per iteration (s): 1.04 | learning rate: 4.173E-05 | global batch size: 256 | lm loss: 1.909321E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.269 | TFLOPs: 40.70 | 15: iteration 97390/ 125429 | consumed samples: 24931840 | consumed tokens: 51060408320 | elapsed time per iteration (s): 1.03 | learning rate: 4.171E-05 | global batch size: 256 | lm loss: 1.925463E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.462 | TFLOPs: 41.23 | 15: iteration 97400/ 125429 | consumed samples: 24934400 | consumed tokens: 51065651200 | elapsed time per iteration (s): 1.05 | learning rate: 4.170E-05 | global batch size: 256 | lm loss: 1.944418E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.704 | TFLOPs: 40.44 | 15: iteration 97410/ 125429 | consumed samples: 24936960 | consumed tokens: 51070894080 | elapsed time per iteration (s): 1.04 | learning rate: 4.168E-05 | global batch size: 256 | lm loss: 1.928650E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.282 | TFLOPs: 40.70 | 15: iteration 97420/ 125429 | consumed samples: 24939520 | consumed tokens: 51076136960 | elapsed time per iteration (s): 1.07 | learning rate: 4.167E-05 | global batch size: 256 | lm loss: 1.906833E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.508 | TFLOPs: 39.58 | 15: iteration 97430/ 125429 | consumed samples: 24942080 | consumed tokens: 51081379840 | elapsed time per iteration (s): 1.06 | learning rate: 4.165E-05 | global batch size: 256 | lm loss: 1.897671E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.952 | TFLOPs: 39.98 | 15: iteration 97440/ 125429 | consumed samples: 24944640 | consumed tokens: 51086622720 | elapsed time per iteration (s): 1.06 | learning rate: 4.164E-05 | global batch size: 256 | lm loss: 1.905577E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.039 | TFLOPs: 39.83 | 15: iteration 97450/ 125429 | consumed samples: 24947200 | consumed tokens: 51091865600 | elapsed time per iteration (s): 1.03 | learning rate: 4.162E-05 | global batch size: 256 | lm loss: 1.913571E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.016 | TFLOPs: 40.99 | 15: iteration 97460/ 125429 | consumed samples: 24949760 | consumed tokens: 51097108480 | elapsed time per iteration (s): 1.04 | learning rate: 4.161E-05 | global batch size: 256 | lm loss: 1.914715E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.326 | TFLOPs: 40.87 | 15: iteration 97470/ 125429 | consumed samples: 24952320 | consumed tokens: 51102351360 | elapsed time per iteration (s): 1.03 | learning rate: 4.159E-05 | global batch size: 256 | lm loss: 1.917211E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.519 | TFLOPs: 41.23 | 15: iteration 97480/ 125429 | consumed samples: 24954880 | consumed tokens: 51107594240 | elapsed time per iteration (s): 1.02 | learning rate: 4.158E-05 | global batch size: 256 | lm loss: 1.884537E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.034 | TFLOPs: 41.32 | 15: iteration 97490/ 125429 | consumed samples: 24957440 | consumed tokens: 51112837120 | elapsed time per iteration (s): 1.07 | learning rate: 4.156E-05 | global batch size: 256 | lm loss: 1.908214E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.284 | TFLOPs: 39.71 | 15: iteration 97500/ 125429 | consumed samples: 24960000 | consumed tokens: 51118080000 | elapsed time per iteration (s): 1.04 | learning rate: 4.155E-05 | global batch size: 256 | lm loss: 1.899213E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.128 | TFLOPs: 40.67 | 15: iteration 97510/ 125429 | consumed samples: 24962560 | consumed tokens: 51123322880 | elapsed time per iteration (s): 1.03 | learning rate: 4.153E-05 | global batch size: 256 | lm loss: 1.892478E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.565 | TFLOPs: 41.24 | 15: iteration 97520/ 125429 | consumed samples: 24965120 | consumed tokens: 51128565760 | elapsed time per iteration (s): 1.02 | learning rate: 4.152E-05 | global batch size: 256 | lm loss: 1.920482E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.482 | TFLOPs: 41.56 | 15: iteration 97530/ 125429 | consumed samples: 24967680 | consumed tokens: 51133808640 | elapsed time per iteration (s): 1.03 | learning rate: 4.150E-05 | global batch size: 256 | lm loss: 1.920688E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.490 | TFLOPs: 41.23 | 15: iteration 97540/ 125429 | consumed samples: 24970240 | consumed tokens: 51139051520 | elapsed time per iteration (s): 1.03 | learning rate: 4.149E-05 | global batch size: 256 | lm loss: 1.925678E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.553 | TFLOPs: 41.24 | 15: iteration 97550/ 125429 | consumed samples: 24972800 | consumed tokens: 51144294400 | elapsed time per iteration (s): 1.03 | learning rate: 4.147E-05 | global batch size: 256 | lm loss: 1.913602E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.959 | TFLOPs: 40.98 | 15: iteration 97560/ 125429 | consumed samples: 24975360 | consumed tokens: 51149537280 | elapsed time per iteration (s): 1.05 | learning rate: 4.146E-05 | global batch size: 256 | lm loss: 1.892985E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.861 | TFLOPs: 40.30 | 15: iteration 97570/ 125429 | consumed samples: 24977920 | consumed tokens: 51154780160 | elapsed time per iteration (s): 1.04 | learning rate: 4.144E-05 | global batch size: 256 | lm loss: 1.928986E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.804 | TFLOPs: 40.79 | 15: iteration 97580/ 125429 | consumed samples: 24980480 | consumed tokens: 51160023040 | elapsed time per iteration (s): 1.04 | learning rate: 4.143E-05 | global batch size: 256 | lm loss: 1.913768E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.241 | TFLOPs: 40.69 | 15: iteration 97590/ 125429 | consumed samples: 24983040 | consumed tokens: 51165265920 | elapsed time per iteration (s): 1.02 | learning rate: 4.142E-05 | global batch size: 256 | lm loss: 1.914069E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.814 | TFLOPs: 41.28 | 15: iteration 97600/ 125429 | consumed samples: 24985600 | consumed tokens: 51170508800 | elapsed time per iteration (s): 1.03 | learning rate: 4.140E-05 | global batch size: 256 | lm loss: 1.914799E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.556 | TFLOPs: 40.91 | 15: iteration 97610/ 125429 | consumed samples: 24988160 | consumed tokens: 51175751680 | elapsed time per iteration (s): 1.04 | learning rate: 4.139E-05 | global batch size: 256 | lm loss: 1.900689E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.622 | TFLOPs: 40.76 | 15: iteration 97620/ 125429 | consumed samples: 24990720 | consumed tokens: 51180994560 | elapsed time per iteration (s): 1.05 | learning rate: 4.137E-05 | global batch size: 256 | lm loss: 1.951738E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.248 | TFLOPs: 40.20 | 15: iteration 97630/ 125429 | consumed samples: 24993280 | consumed tokens: 51186237440 | elapsed time per iteration (s): 1.04 | learning rate: 4.136E-05 | global batch size: 256 | lm loss: 1.940198E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.968 | TFLOPs: 40.81 | 15: iteration 97640/ 125429 | consumed samples: 24995840 | consumed tokens: 51191480320 | elapsed time per iteration (s): 1.05 | learning rate: 4.134E-05 | global batch size: 256 | lm loss: 1.922112E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.304 | TFLOPs: 40.37 | 15: iteration 97650/ 125429 | consumed samples: 24998400 | consumed tokens: 51196723200 | elapsed time per iteration (s): 1.04 | learning rate: 4.133E-05 | global batch size: 256 | lm loss: 1.921087E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.958 | TFLOPs: 40.81 | 15: iteration 97660/ 125429 | consumed samples: 25000960 | consumed tokens: 51201966080 | elapsed time per iteration (s): 1.05 | learning rate: 4.131E-05 | global batch size: 256 | lm loss: 1.939284E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.214 | TFLOPs: 40.36 | 15: iteration 97670/ 125429 | consumed samples: 25003520 | consumed tokens: 51207208960 | elapsed time per iteration (s): 1.04 | learning rate: 4.130E-05 | global batch size: 256 | lm loss: 1.918858E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.304 | TFLOPs: 40.70 | 15: iteration 97680/ 125429 | consumed samples: 25006080 | consumed tokens: 51212451840 | elapsed time per iteration (s): 1.05 | learning rate: 4.128E-05 | global batch size: 256 | lm loss: 1.923034E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.053 | TFLOPs: 40.33 | 15: iteration 97690/ 125429 | consumed samples: 25008640 | consumed tokens: 51217694720 | elapsed time per iteration (s): 1.04 | learning rate: 4.127E-05 | global batch size: 256 | lm loss: 1.914908E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.974 | TFLOPs: 40.81 | 15: iteration 97700/ 125429 | consumed samples: 25011200 | consumed tokens: 51222937600 | elapsed time per iteration (s): 1.04 | learning rate: 4.125E-05 | global batch size: 256 | lm loss: 1.950599E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.490 | TFLOPs: 40.73 | 15: iteration 97710/ 125429 | consumed samples: 25013760 | consumed tokens: 51228180480 | elapsed time per iteration (s): 1.03 | learning rate: 4.124E-05 | global batch size: 256 | lm loss: 1.900838E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.038 | TFLOPs: 40.99 | 15: iteration 97720/ 125429 | consumed samples: 25016320 | consumed tokens: 51233423360 | elapsed time per iteration (s): 1.09 | learning rate: 4.122E-05 | global batch size: 256 | lm loss: 1.923602E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.713 | TFLOPs: 38.79 | 15: iteration 97730/ 125429 | consumed samples: 25018880 | consumed tokens: 51238666240 | elapsed time per iteration (s): 1.04 | learning rate: 4.121E-05 | global batch size: 256 | lm loss: 1.900864E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.901 | TFLOPs: 40.80 | 15: iteration 97740/ 125429 | consumed samples: 25021440 | consumed tokens: 51243909120 | elapsed time per iteration (s): 1.18 | learning rate: 4.119E-05 | global batch size: 256 | lm loss: 1.897412E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.698 | TFLOPs: 35.81 | 15: iteration 97750/ 125429 | consumed samples: 25024000 | consumed tokens: 51249152000 | elapsed time per iteration (s): 1.03 | learning rate: 4.118E-05 | global batch size: 256 | lm loss: 1.930637E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.430 | TFLOPs: 41.22 | 15: iteration 97760/ 125429 | consumed samples: 25026560 | consumed tokens: 51254394880 | elapsed time per iteration (s): 1.09 | learning rate: 4.117E-05 | global batch size: 256 | lm loss: 1.924735E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.524 | TFLOPs: 38.76 | 15: iteration 97770/ 125429 | consumed samples: 25029120 | consumed tokens: 51259637760 | elapsed time per iteration (s): 1.02 | learning rate: 4.115E-05 | global batch size: 256 | lm loss: 1.925030E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.762 | TFLOPs: 41.28 | 15: iteration 97780/ 125429 | consumed samples: 25031680 | consumed tokens: 51264880640 | elapsed time per iteration (s): 1.02 | learning rate: 4.114E-05 | global batch size: 256 | lm loss: 1.903091E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.629 | TFLOPs: 41.42 | 15: iteration 97790/ 125429 | consumed samples: 25034240 | consumed tokens: 51270123520 | elapsed time per iteration (s): 1.03 | learning rate: 4.112E-05 | global batch size: 256 | lm loss: 1.928684E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.523 | TFLOPs: 41.07 | 15: iteration 97800/ 125429 | consumed samples: 25036800 | consumed tokens: 51275366400 | elapsed time per iteration (s): 1.07 | learning rate: 4.111E-05 | global batch size: 256 | lm loss: 1.903670E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.635 | TFLOPs: 39.60 | 15: iteration 97810/ 125429 | consumed samples: 25039360 | consumed tokens: 51280609280 | elapsed time per iteration (s): 1.02 | learning rate: 4.109E-05 | global batch size: 256 | lm loss: 1.929262E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.500 | TFLOPs: 41.40 | 15: iteration 97820/ 125429 | consumed samples: 25041920 | consumed tokens: 51285852160 | elapsed time per iteration (s): 1.05 | learning rate: 4.108E-05 | global batch size: 256 | lm loss: 1.907216E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.109 | TFLOPs: 40.18 | 15: iteration 97830/ 125429 | consumed samples: 25044480 | consumed tokens: 51291095040 | elapsed time per iteration (s): 1.04 | learning rate: 4.106E-05 | global batch size: 256 | lm loss: 1.909523E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.426 | TFLOPs: 40.72 | 15: iteration 97840/ 125429 | consumed samples: 25047040 | consumed tokens: 51296337920 | elapsed time per iteration (s): 1.03 | learning rate: 4.105E-05 | global batch size: 256 | lm loss: 1.919811E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.525 | TFLOPs: 41.07 | 15: iteration 97850/ 125429 | consumed samples: 25049600 | consumed tokens: 51301580800 | elapsed time per iteration (s): 1.03 | learning rate: 4.103E-05 | global batch size: 256 | lm loss: 1.916666E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.911 | TFLOPs: 41.13 | 15: iteration 97860/ 125429 | consumed samples: 25052160 | consumed tokens: 51306823680 | elapsed time per iteration (s): 1.03 | learning rate: 4.102E-05 | global batch size: 256 | lm loss: 1.910832E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.203 | TFLOPs: 41.18 | 15: iteration 97870/ 125429 | consumed samples: 25054720 | consumed tokens: 51312066560 | elapsed time per iteration (s): 1.02 | learning rate: 4.100E-05 | global batch size: 256 | lm loss: 1.939465E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.421 | TFLOPs: 41.38 | 15: iteration 97880/ 125429 | consumed samples: 25057280 | consumed tokens: 51317309440 | elapsed time per iteration (s): 1.02 | learning rate: 4.099E-05 | global batch size: 256 | lm loss: 1.896700E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.831 | TFLOPs: 41.45 | 15: iteration 97890/ 125429 | consumed samples: 25059840 | consumed tokens: 51322552320 | elapsed time per iteration (s): 1.03 | learning rate: 4.098E-05 | global batch size: 256 | lm loss: 1.908552E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.434 | TFLOPs: 40.89 | 15: iteration 97900/ 125429 | consumed samples: 25062400 | consumed tokens: 51327795200 | elapsed time per iteration (s): 1.04 | learning rate: 4.096E-05 | global batch size: 256 | lm loss: 1.899090E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.032 | TFLOPs: 40.66 | 15: iteration 97910/ 125429 | consumed samples: 25064960 | consumed tokens: 51333038080 | elapsed time per iteration (s): 1.05 | learning rate: 4.095E-05 | global batch size: 256 | lm loss: 1.919707E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.740 | TFLOPs: 40.28 | 15: iteration 97920/ 125429 | consumed samples: 25067520 | consumed tokens: 51338280960 | elapsed time per iteration (s): 1.02 | learning rate: 4.093E-05 | global batch size: 256 | lm loss: 1.909575E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.011 | TFLOPs: 41.48 | 15: iteration 97930/ 125429 | consumed samples: 25070080 | consumed tokens: 51343523840 | elapsed time per iteration (s): 1.04 | learning rate: 4.092E-05 | global batch size: 256 | lm loss: 1.931281E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.259 | TFLOPs: 40.53 | 15: iteration 97940/ 125429 | consumed samples: 25072640 | consumed tokens: 51348766720 | elapsed time per iteration (s): 1.04 | learning rate: 4.090E-05 | global batch size: 256 | lm loss: 1.931988E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.823 | TFLOPs: 40.62 | 15: iteration 97950/ 125429 | consumed samples: 25075200 | consumed tokens: 51354009600 | elapsed time per iteration (s): 1.04 | learning rate: 4.089E-05 | global batch size: 256 | lm loss: 1.903190E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.251 | TFLOPs: 40.86 | 15: iteration 97960/ 125429 | consumed samples: 25077760 | consumed tokens: 51359252480 | elapsed time per iteration (s): 1.04 | learning rate: 4.087E-05 | global batch size: 256 | lm loss: 1.945039E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.941 | TFLOPs: 40.64 | 15: iteration 97970/ 125429 | consumed samples: 25080320 | consumed tokens: 51364495360 | elapsed time per iteration (s): 1.04 | learning rate: 4.086E-05 | global batch size: 256 | lm loss: 1.909422E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.189 | TFLOPs: 40.52 | 15: iteration 97980/ 125429 | consumed samples: 25082880 | consumed tokens: 51369738240 | elapsed time per iteration (s): 1.03 | learning rate: 4.084E-05 | global batch size: 256 | lm loss: 1.937289E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.226 | TFLOPs: 41.02 | 15: iteration 97990/ 125429 | consumed samples: 25085440 | consumed tokens: 51374981120 | elapsed time per iteration (s): 1.06 | learning rate: 4.083E-05 | global batch size: 256 | lm loss: 1.912870E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.713 | TFLOPs: 39.94 | 0: [2022-11-27 01:03:38,362] [INFO] [logging.py:68:log_dist] [Rank 0] step=98000, skipped=0, lr=[4.081461791116172e-05, 4.081461791116172e-05, 4.081461791116172e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 98000/ 125429 | consumed samples: 25088000 | consumed tokens: 51380224000 | elapsed time per iteration (s): 1.07 | learning rate: 4.081E-05 | global batch size: 256 | lm loss: 1.881955E+00 | grad norm: 0.143 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.303 | TFLOPs: 39.55 | 0: steps: 98000 loss: 1.8475 iter time (s): 1.039 samples/sec: 246.425 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 98000 | lm loss value: 1.905949E+00 | lm loss PPL: 6.725787E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 98000 to checkpoints_1b5 0: [2022-11-27 01:03:38,715] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step98000 is begin to save! 0: [2022-11-27 01:03:38,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:03:38,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:03:38,982] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:03:39,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:03:39,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:03:39,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:03:39,201] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:03:39,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:03:39,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:03:39,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:03:39,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:03:39,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:03:39,542] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:03:39,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:03:39,657] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:03:39,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:03:39,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:03:39,888] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:03:39,889] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:03:40,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:03:40,005] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:03:40,120] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:03:40,120] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:03:40,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:03:40,238] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:03:40,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:03:40,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:03:40,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:03:40,475] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:03:40,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:03:40,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:03:40,710] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:03:40,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:03:40,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:03:40,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:03:40,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:03:40,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:03:41,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:03:41,064] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:03:41,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:03:41,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:03:41,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:03:41,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:03:41,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:03:41,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:03:41,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:03:41,519] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:03:41,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:03:41,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:03:41,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:03:41,742] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:03:41,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:03:41,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:03:41,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:03:41,966] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_29-model_00-model_states.pt... 0: [2022-11-27 01:03:42,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_29-model_00-model_states.pt. 0: [2022-11-27 01:03:42,077] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:03:42,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:03:42,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/layer_32-model_00-model_states.pt... 0: [2022-11-27 01:03:42,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/layer_32-model_00-model_states.pt. 0: [2022-11-27 01:03:42,193] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step98000/mp_rank_00_model_states.pt 0: [2022-11-27 01:03:42,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:03:42,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:03:42,236] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step98000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:03:42,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-27 01:03:42,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:03:42,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:03:42,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-27 01:03:42,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:03:42,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:03:42,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 0: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 01:03:42,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 01:03:42,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:03:42,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 01:03:42,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:03:42,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 0: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-27 01:03:42,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:03:42,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-27 01:03:42,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:03:42,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:03:42,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:03:42,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:03:42,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:03:42,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-27 01:03:42,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-27 01:03:42,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:03:42,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-27 01:03:42,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:03:42,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 15: [2022-11-27 01:03:42,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:03:42,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 01:03:42,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-27 01:03:42,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:03:42,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:03:42,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,440] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 11: [2022-11-27 01:03:42,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:03:42,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:03:42,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:03:42,442] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,442] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:03:42,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:03:42,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,440] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:03:42,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:03:42,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 8: [2022-11-27 01:03:42,437] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:03:42,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 7: [2022-11-27 01:03:42,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 8: [2022-11-27 01:03:42,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 7: [2022-11-27 01:03:42,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:03:42,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 14: [2022-11-27 01:03:42,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:03:42,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 01:03:42,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,429] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,429] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 4: [2022-11-27 01:03:42,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:03:42,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:03:42,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:03:42,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,454] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:03:42,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 12: [2022-11-27 01:03:42,454] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 13: [2022-11-27 01:03:42,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:03:42,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:03:42,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 9: [2022-11-27 01:03:42,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:03:42,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:03:42,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,459] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,459] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 10: [2022-11-27 01:03:42,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:03:42,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 01:03:42,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 5: [2022-11-27 01:03:42,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:03:42,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 01:03:42,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:03:42,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 01:03:42,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 3: [2022-11-27 01:03:42,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: [2022-11-27 01:03:42,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 01:03:42,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:03:42,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:03:42,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 1: [2022-11-27 01:03:42,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:03:42,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 01:03:42,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,524] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 6: [2022-11-27 01:03:42,524] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:03:42,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step98000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 01:03:42,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step98000 is ready now! 0: successfully saved checkpoint at iteration 98000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3928.53 15: iteration 98010/ 125429 | consumed samples: 25090560 | consumed tokens: 51385466880 | elapsed time per iteration (s): 1.47 | learning rate: 4.080E-05 | global batch size: 256 | lm loss: 1.938283E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.198 | TFLOPs: 28.79 | 15: iteration 98020/ 125429 | consumed samples: 25093120 | consumed tokens: 51390709760 | elapsed time per iteration (s): 1.04 | learning rate: 4.079E-05 | global batch size: 256 | lm loss: 1.870421E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.507 | TFLOPs: 40.57 | 15: iteration 98030/ 125429 | consumed samples: 25095680 | consumed tokens: 51395952640 | elapsed time per iteration (s): 1.03 | learning rate: 4.077E-05 | global batch size: 256 | lm loss: 1.927733E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.471 | TFLOPs: 40.90 | 15: iteration 98040/ 125429 | consumed samples: 25098240 | consumed tokens: 51401195520 | elapsed time per iteration (s): 1.04 | learning rate: 4.076E-05 | global batch size: 256 | lm loss: 1.925341E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.299 | TFLOPs: 40.87 | 15: iteration 98050/ 125429 | consumed samples: 25100800 | consumed tokens: 51406438400 | elapsed time per iteration (s): 1.05 | learning rate: 4.074E-05 | global batch size: 256 | lm loss: 1.931516E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.408 | TFLOPs: 40.39 | 15: iteration 98060/ 125429 | consumed samples: 25103360 | consumed tokens: 51411681280 | elapsed time per iteration (s): 1.06 | learning rate: 4.073E-05 | global batch size: 256 | lm loss: 1.920582E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.029 | TFLOPs: 40.00 | 15: iteration 98070/ 125429 | consumed samples: 25105920 | consumed tokens: 51416924160 | elapsed time per iteration (s): 1.07 | learning rate: 4.071E-05 | global batch size: 256 | lm loss: 1.945362E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.434 | TFLOPs: 39.57 | 15: iteration 98080/ 125429 | consumed samples: 25108480 | consumed tokens: 51422167040 | elapsed time per iteration (s): 1.04 | learning rate: 4.070E-05 | global batch size: 256 | lm loss: 1.895607E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.836 | TFLOPs: 40.63 | 15: iteration 98090/ 125429 | consumed samples: 25111040 | consumed tokens: 51427409920 | elapsed time per iteration (s): 1.03 | learning rate: 4.068E-05 | global batch size: 256 | lm loss: 1.928122E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.423 | TFLOPs: 40.89 | 15: iteration 98100/ 125429 | consumed samples: 25113600 | consumed tokens: 51432652800 | elapsed time per iteration (s): 1.05 | learning rate: 4.067E-05 | global batch size: 256 | lm loss: 1.905262E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.138 | TFLOPs: 40.35 | 15: iteration 98110/ 125429 | consumed samples: 25116160 | consumed tokens: 51437895680 | elapsed time per iteration (s): 1.06 | learning rate: 4.065E-05 | global batch size: 256 | lm loss: 1.925731E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.115 | TFLOPs: 39.85 | 15: iteration 98120/ 125429 | consumed samples: 25118720 | consumed tokens: 51443138560 | elapsed time per iteration (s): 1.05 | learning rate: 4.064E-05 | global batch size: 256 | lm loss: 1.903391E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.074 | TFLOPs: 40.34 | 15: iteration 98130/ 125429 | consumed samples: 25121280 | consumed tokens: 51448381440 | elapsed time per iteration (s): 1.08 | learning rate: 4.063E-05 | global batch size: 256 | lm loss: 1.897859E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.329 | TFLOPs: 39.06 | 15: iteration 98140/ 125429 | consumed samples: 25123840 | consumed tokens: 51453624320 | elapsed time per iteration (s): 1.08 | learning rate: 4.061E-05 | global batch size: 256 | lm loss: 1.909221E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.267 | TFLOPs: 39.21 | 15: iteration 98150/ 125429 | consumed samples: 25126400 | consumed tokens: 51458867200 | elapsed time per iteration (s): 1.08 | learning rate: 4.060E-05 | global batch size: 256 | lm loss: 1.913825E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.482 | TFLOPs: 39.25 | 15: iteration 98160/ 125429 | consumed samples: 25128960 | consumed tokens: 51464110080 | elapsed time per iteration (s): 1.03 | learning rate: 4.058E-05 | global batch size: 256 | lm loss: 1.896215E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.367 | TFLOPs: 41.04 | 15: iteration 98170/ 125429 | consumed samples: 25131520 | consumed tokens: 51469352960 | elapsed time per iteration (s): 1.07 | learning rate: 4.057E-05 | global batch size: 256 | lm loss: 1.921564E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.432 | TFLOPs: 39.57 | 15: iteration 98180/ 125429 | consumed samples: 25134080 | consumed tokens: 51474595840 | elapsed time per iteration (s): 1.05 | learning rate: 4.055E-05 | global batch size: 256 | lm loss: 1.937029E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.366 | TFLOPs: 40.38 | 15: iteration 98190/ 125429 | consumed samples: 25136640 | consumed tokens: 51479838720 | elapsed time per iteration (s): 1.02 | learning rate: 4.054E-05 | global batch size: 256 | lm loss: 1.911558E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.549 | TFLOPs: 41.41 | 15: iteration 98200/ 125429 | consumed samples: 25139200 | consumed tokens: 51485081600 | elapsed time per iteration (s): 1.05 | learning rate: 4.052E-05 | global batch size: 256 | lm loss: 1.921461E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.346 | TFLOPs: 40.21 | 15: iteration 98210/ 125429 | consumed samples: 25141760 | consumed tokens: 51490324480 | elapsed time per iteration (s): 1.07 | learning rate: 4.051E-05 | global batch size: 256 | lm loss: 1.923068E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.383 | TFLOPs: 39.39 | 15: iteration 98220/ 125429 | consumed samples: 25144320 | consumed tokens: 51495567360 | elapsed time per iteration (s): 1.03 | learning rate: 4.050E-05 | global batch size: 256 | lm loss: 1.899525E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.163 | TFLOPs: 41.18 | 15: iteration 98230/ 125429 | consumed samples: 25146880 | consumed tokens: 51500810240 | elapsed time per iteration (s): 1.04 | learning rate: 4.048E-05 | global batch size: 256 | lm loss: 1.923702E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.110 | TFLOPs: 40.51 | 15: iteration 98240/ 125429 | consumed samples: 25149440 | consumed tokens: 51506053120 | elapsed time per iteration (s): 1.04 | learning rate: 4.047E-05 | global batch size: 256 | lm loss: 1.899768E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.097 | TFLOPs: 40.50 | 15: iteration 98250/ 125429 | consumed samples: 25152000 | consumed tokens: 51511296000 | elapsed time per iteration (s): 1.02 | learning rate: 4.045E-05 | global batch size: 256 | lm loss: 1.910191E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.959 | TFLOPs: 41.31 | 15: iteration 98260/ 125429 | consumed samples: 25154560 | consumed tokens: 51516538880 | elapsed time per iteration (s): 1.02 | learning rate: 4.044E-05 | global batch size: 256 | lm loss: 1.931651E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.774 | TFLOPs: 41.28 | 15: iteration 98270/ 125429 | consumed samples: 25157120 | consumed tokens: 51521781760 | elapsed time per iteration (s): 1.09 | learning rate: 4.042E-05 | global batch size: 256 | lm loss: 1.908810E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.895 | TFLOPs: 38.65 | 15: iteration 98280/ 125429 | consumed samples: 25159680 | consumed tokens: 51527024640 | elapsed time per iteration (s): 1.03 | learning rate: 4.041E-05 | global batch size: 256 | lm loss: 1.942134E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.813 | TFLOPs: 40.95 | 15: iteration 98290/ 125429 | consumed samples: 25162240 | consumed tokens: 51532267520 | elapsed time per iteration (s): 1.04 | learning rate: 4.039E-05 | global batch size: 256 | lm loss: 1.891386E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.059 | TFLOPs: 40.66 | 15: iteration 98300/ 125429 | consumed samples: 25164800 | consumed tokens: 51537510400 | elapsed time per iteration (s): 1.06 | learning rate: 4.038E-05 | global batch size: 256 | lm loss: 1.893424E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.415 | TFLOPs: 40.06 | 15: iteration 98310/ 125429 | consumed samples: 25167360 | consumed tokens: 51542753280 | elapsed time per iteration (s): 1.07 | learning rate: 4.037E-05 | global batch size: 256 | lm loss: 1.924596E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.206 | TFLOPs: 39.37 | 15: iteration 98320/ 125429 | consumed samples: 25169920 | consumed tokens: 51547996160 | elapsed time per iteration (s): 1.08 | learning rate: 4.035E-05 | global batch size: 256 | lm loss: 1.909893E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.116 | TFLOPs: 39.19 | 15: iteration 98330/ 125429 | consumed samples: 25172480 | consumed tokens: 51553239040 | elapsed time per iteration (s): 1.03 | learning rate: 4.034E-05 | global batch size: 256 | lm loss: 1.922334E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.401 | TFLOPs: 41.22 | 15: iteration 98340/ 125429 | consumed samples: 25175040 | consumed tokens: 51558481920 | elapsed time per iteration (s): 1.06 | learning rate: 4.032E-05 | global batch size: 256 | lm loss: 1.894945E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.613 | TFLOPs: 39.76 | 15: iteration 98350/ 125429 | consumed samples: 25177600 | consumed tokens: 51563724800 | elapsed time per iteration (s): 1.18 | learning rate: 4.031E-05 | global batch size: 256 | lm loss: 1.919204E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.438 | TFLOPs: 35.93 | 15: iteration 98360/ 125429 | consumed samples: 25180160 | consumed tokens: 51568967680 | elapsed time per iteration (s): 1.05 | learning rate: 4.029E-05 | global batch size: 256 | lm loss: 1.912150E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.238 | TFLOPs: 40.20 | 15: iteration 98370/ 125429 | consumed samples: 25182720 | consumed tokens: 51574210560 | elapsed time per iteration (s): 1.07 | learning rate: 4.028E-05 | global batch size: 256 | lm loss: 1.909753E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.916 | TFLOPs: 39.48 | 15: iteration 98380/ 125429 | consumed samples: 25185280 | consumed tokens: 51579453440 | elapsed time per iteration (s): 1.03 | learning rate: 4.026E-05 | global batch size: 256 | lm loss: 1.949070E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.713 | TFLOPs: 40.94 | 15: iteration 98390/ 125429 | consumed samples: 25187840 | consumed tokens: 51584696320 | elapsed time per iteration (s): 1.02 | learning rate: 4.025E-05 | global batch size: 256 | lm loss: 1.908140E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.279 | TFLOPs: 41.36 | 15: iteration 98400/ 125429 | consumed samples: 25190400 | consumed tokens: 51589939200 | elapsed time per iteration (s): 1.02 | learning rate: 4.024E-05 | global batch size: 256 | lm loss: 1.915099E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.959 | TFLOPs: 41.31 | 15: iteration 98410/ 125429 | consumed samples: 25192960 | consumed tokens: 51595182080 | elapsed time per iteration (s): 1.06 | learning rate: 4.022E-05 | global batch size: 256 | lm loss: 1.889263E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.901 | TFLOPs: 39.98 | 15: iteration 98420/ 125429 | consumed samples: 25195520 | consumed tokens: 51600424960 | elapsed time per iteration (s): 1.04 | learning rate: 4.021E-05 | global batch size: 256 | lm loss: 1.937167E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.138 | TFLOPs: 40.51 | 15: iteration 98430/ 125429 | consumed samples: 25198080 | consumed tokens: 51605667840 | elapsed time per iteration (s): 1.05 | learning rate: 4.019E-05 | global batch size: 256 | lm loss: 1.887765E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.050 | TFLOPs: 40.33 | 15: iteration 98440/ 125429 | consumed samples: 25200640 | consumed tokens: 51610910720 | elapsed time per iteration (s): 1.19 | learning rate: 4.018E-05 | global batch size: 256 | lm loss: 1.931783E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.787 | TFLOPs: 35.66 | 15: iteration 98450/ 125429 | consumed samples: 25203200 | consumed tokens: 51616153600 | elapsed time per iteration (s): 1.06 | learning rate: 4.016E-05 | global batch size: 256 | lm loss: 1.920846E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.591 | TFLOPs: 39.76 | 15: iteration 98460/ 125429 | consumed samples: 25205760 | consumed tokens: 51621396480 | elapsed time per iteration (s): 1.02 | learning rate: 4.015E-05 | global batch size: 256 | lm loss: 1.891609E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.195 | TFLOPs: 41.35 | 15: iteration 98470/ 125429 | consumed samples: 25208320 | consumed tokens: 51626639360 | elapsed time per iteration (s): 1.15 | learning rate: 4.014E-05 | global batch size: 256 | lm loss: 1.921703E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.410 | TFLOPs: 36.92 | 15: iteration 98480/ 125429 | consumed samples: 25210880 | consumed tokens: 51631882240 | elapsed time per iteration (s): 1.04 | learning rate: 4.012E-05 | global batch size: 256 | lm loss: 1.935905E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.324 | TFLOPs: 40.71 | 15: iteration 98490/ 125429 | consumed samples: 25213440 | consumed tokens: 51637125120 | elapsed time per iteration (s): 1.02 | learning rate: 4.011E-05 | global batch size: 256 | lm loss: 1.924847E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.591 | TFLOPs: 41.41 | 15: iteration 98500/ 125429 | consumed samples: 25216000 | consumed tokens: 51642368000 | elapsed time per iteration (s): 1.07 | learning rate: 4.009E-05 | global batch size: 256 | lm loss: 1.922090E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.396 | TFLOPs: 39.56 | 15: iteration 98510/ 125429 | consumed samples: 25218560 | consumed tokens: 51647610880 | elapsed time per iteration (s): 1.05 | learning rate: 4.008E-05 | global batch size: 256 | lm loss: 1.901364E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.496 | TFLOPs: 40.24 | 15: iteration 98520/ 125429 | consumed samples: 25221120 | consumed tokens: 51652853760 | elapsed time per iteration (s): 1.04 | learning rate: 4.006E-05 | global batch size: 256 | lm loss: 1.898787E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.153 | TFLOPs: 40.84 | 15: iteration 98530/ 125429 | consumed samples: 25223680 | consumed tokens: 51658096640 | elapsed time per iteration (s): 1.03 | learning rate: 4.005E-05 | global batch size: 256 | lm loss: 1.914881E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.315 | TFLOPs: 41.04 | 15: iteration 98540/ 125429 | consumed samples: 25226240 | consumed tokens: 51663339520 | elapsed time per iteration (s): 1.03 | learning rate: 4.003E-05 | global batch size: 256 | lm loss: 1.895303E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.500 | TFLOPs: 40.90 | 15: iteration 98550/ 125429 | consumed samples: 25228800 | consumed tokens: 51668582400 | elapsed time per iteration (s): 1.02 | learning rate: 4.002E-05 | global batch size: 256 | lm loss: 1.917195E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.514 | TFLOPs: 41.40 | 15: iteration 98560/ 125429 | consumed samples: 25231360 | consumed tokens: 51673825280 | elapsed time per iteration (s): 1.05 | learning rate: 4.001E-05 | global batch size: 256 | lm loss: 1.906080E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.371 | TFLOPs: 40.22 | 15: iteration 98570/ 125429 | consumed samples: 25233920 | consumed tokens: 51679068160 | elapsed time per iteration (s): 1.07 | learning rate: 3.999E-05 | global batch size: 256 | lm loss: 1.946188E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.096 | TFLOPs: 39.51 | 15: iteration 98580/ 125429 | consumed samples: 25236480 | consumed tokens: 51684311040 | elapsed time per iteration (s): 1.03 | learning rate: 3.998E-05 | global batch size: 256 | lm loss: 1.898013E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.045 | TFLOPs: 41.16 | 15: iteration 98590/ 125429 | consumed samples: 25239040 | consumed tokens: 51689553920 | elapsed time per iteration (s): 1.03 | learning rate: 3.996E-05 | global batch size: 256 | lm loss: 1.881969E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.062 | TFLOPs: 41.16 | 15: iteration 98600/ 125429 | consumed samples: 25241600 | consumed tokens: 51694796800 | elapsed time per iteration (s): 1.04 | learning rate: 3.995E-05 | global batch size: 256 | lm loss: 1.925871E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.332 | TFLOPs: 40.87 | 15: iteration 98610/ 125429 | consumed samples: 25244160 | consumed tokens: 51700039680 | elapsed time per iteration (s): 1.03 | learning rate: 3.993E-05 | global batch size: 256 | lm loss: 1.897579E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.653 | TFLOPs: 41.26 | 15: iteration 98620/ 125429 | consumed samples: 25246720 | consumed tokens: 51705282560 | elapsed time per iteration (s): 1.03 | learning rate: 3.992E-05 | global batch size: 256 | lm loss: 1.912561E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.120 | TFLOPs: 41.00 | 15: iteration 98630/ 125429 | consumed samples: 25249280 | consumed tokens: 51710525440 | elapsed time per iteration (s): 1.06 | learning rate: 3.991E-05 | global batch size: 256 | lm loss: 1.912571E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.844 | TFLOPs: 39.80 | 15: iteration 98640/ 125429 | consumed samples: 25251840 | consumed tokens: 51715768320 | elapsed time per iteration (s): 1.02 | learning rate: 3.989E-05 | global batch size: 256 | lm loss: 1.878471E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.980 | TFLOPs: 41.64 | 15: iteration 98650/ 125429 | consumed samples: 25254400 | consumed tokens: 51721011200 | elapsed time per iteration (s): 1.06 | learning rate: 3.988E-05 | global batch size: 256 | lm loss: 1.893016E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.876 | TFLOPs: 39.97 | 15: iteration 98660/ 125429 | consumed samples: 25256960 | consumed tokens: 51726254080 | elapsed time per iteration (s): 1.02 | learning rate: 3.986E-05 | global batch size: 256 | lm loss: 1.891910E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.846 | TFLOPs: 41.29 | 15: iteration 98670/ 125429 | consumed samples: 25259520 | consumed tokens: 51731496960 | elapsed time per iteration (s): 1.03 | learning rate: 3.985E-05 | global batch size: 256 | lm loss: 1.936396E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.374 | TFLOPs: 41.21 | 15: iteration 98680/ 125429 | consumed samples: 25262080 | consumed tokens: 51736739840 | elapsed time per iteration (s): 1.08 | learning rate: 3.983E-05 | global batch size: 256 | lm loss: 1.905960E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.437 | TFLOPs: 39.24 | 15: iteration 98690/ 125429 | consumed samples: 25264640 | consumed tokens: 51741982720 | elapsed time per iteration (s): 1.03 | learning rate: 3.982E-05 | global batch size: 256 | lm loss: 1.903767E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.989 | TFLOPs: 40.98 | 15: iteration 98700/ 125429 | consumed samples: 25267200 | consumed tokens: 51747225600 | elapsed time per iteration (s): 1.09 | learning rate: 3.981E-05 | global batch size: 256 | lm loss: 1.899021E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.290 | TFLOPs: 38.88 | 15: iteration 98710/ 125429 | consumed samples: 25269760 | consumed tokens: 51752468480 | elapsed time per iteration (s): 1.05 | learning rate: 3.979E-05 | global batch size: 256 | lm loss: 1.918130E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.066 | TFLOPs: 40.33 | 15: iteration 98720/ 125429 | consumed samples: 25272320 | consumed tokens: 51757711360 | elapsed time per iteration (s): 1.04 | learning rate: 3.978E-05 | global batch size: 256 | lm loss: 1.900359E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.532 | TFLOPs: 40.74 | 15: iteration 98730/ 125429 | consumed samples: 25274880 | consumed tokens: 51762954240 | elapsed time per iteration (s): 1.03 | learning rate: 3.976E-05 | global batch size: 256 | lm loss: 1.881641E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.665 | TFLOPs: 41.09 | 15: iteration 98740/ 125429 | consumed samples: 25277440 | consumed tokens: 51768197120 | elapsed time per iteration (s): 1.07 | learning rate: 3.975E-05 | global batch size: 256 | lm loss: 1.888116E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.720 | TFLOPs: 39.62 | 15: iteration 98750/ 125429 | consumed samples: 25280000 | consumed tokens: 51773440000 | elapsed time per iteration (s): 1.07 | learning rate: 3.973E-05 | global batch size: 256 | lm loss: 1.918222E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.369 | TFLOPs: 39.72 | 15: iteration 98760/ 125429 | consumed samples: 25282560 | consumed tokens: 51778682880 | elapsed time per iteration (s): 1.05 | learning rate: 3.972E-05 | global batch size: 256 | lm loss: 1.882542E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.933 | TFLOPs: 40.48 | 15: iteration 98770/ 125429 | consumed samples: 25285120 | consumed tokens: 51783925760 | elapsed time per iteration (s): 1.03 | learning rate: 3.971E-05 | global batch size: 256 | lm loss: 1.923332E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.955 | TFLOPs: 40.98 | 15: iteration 98780/ 125429 | consumed samples: 25287680 | consumed tokens: 51789168640 | elapsed time per iteration (s): 1.22 | learning rate: 3.969E-05 | global batch size: 256 | lm loss: 1.897088E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 209.338 | TFLOPs: 34.59 | 15: iteration 98790/ 125429 | consumed samples: 25290240 | consumed tokens: 51794411520 | elapsed time per iteration (s): 1.03 | learning rate: 3.968E-05 | global batch size: 256 | lm loss: 1.910509E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.414 | TFLOPs: 41.22 | 15: iteration 98800/ 125429 | consumed samples: 25292800 | consumed tokens: 51799654400 | elapsed time per iteration (s): 1.04 | learning rate: 3.966E-05 | global batch size: 256 | lm loss: 1.909717E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.991 | TFLOPs: 40.82 | 15: iteration 98810/ 125429 | consumed samples: 25295360 | consumed tokens: 51804897280 | elapsed time per iteration (s): 1.05 | learning rate: 3.965E-05 | global batch size: 256 | lm loss: 1.924008E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.888 | TFLOPs: 40.30 | 15: iteration 98820/ 125429 | consumed samples: 25297920 | consumed tokens: 51810140160 | elapsed time per iteration (s): 1.08 | learning rate: 3.964E-05 | global batch size: 256 | lm loss: 1.941811E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.659 | TFLOPs: 39.27 | 15: iteration 98830/ 125429 | consumed samples: 25300480 | consumed tokens: 51815383040 | elapsed time per iteration (s): 1.03 | learning rate: 3.962E-05 | global batch size: 256 | lm loss: 1.890541E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.635 | TFLOPs: 40.92 | 15: iteration 98840/ 125429 | consumed samples: 25303040 | consumed tokens: 51820625920 | elapsed time per iteration (s): 1.05 | learning rate: 3.961E-05 | global batch size: 256 | lm loss: 1.933549E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.629 | TFLOPs: 40.43 | 15: iteration 98850/ 125429 | consumed samples: 25305600 | consumed tokens: 51825868800 | elapsed time per iteration (s): 1.03 | learning rate: 3.959E-05 | global batch size: 256 | lm loss: 1.949123E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.850 | TFLOPs: 41.12 | 15: iteration 98860/ 125429 | consumed samples: 25308160 | consumed tokens: 51831111680 | elapsed time per iteration (s): 1.09 | learning rate: 3.958E-05 | global batch size: 256 | lm loss: 1.926170E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.817 | TFLOPs: 38.81 | 15: iteration 98870/ 125429 | consumed samples: 25310720 | consumed tokens: 51836354560 | elapsed time per iteration (s): 1.04 | learning rate: 3.956E-05 | global batch size: 256 | lm loss: 1.883614E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.292 | TFLOPs: 40.87 | 15: iteration 98880/ 125429 | consumed samples: 25313280 | consumed tokens: 51841597440 | elapsed time per iteration (s): 1.10 | learning rate: 3.955E-05 | global batch size: 256 | lm loss: 1.916641E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.352 | TFLOPs: 38.56 | 15: iteration 98890/ 125429 | consumed samples: 25315840 | consumed tokens: 51846840320 | elapsed time per iteration (s): 1.06 | learning rate: 3.954E-05 | global batch size: 256 | lm loss: 1.936733E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.796 | TFLOPs: 39.79 | 15: iteration 98900/ 125429 | consumed samples: 25318400 | consumed tokens: 51852083200 | elapsed time per iteration (s): 1.03 | learning rate: 3.952E-05 | global batch size: 256 | lm loss: 1.913949E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.108 | TFLOPs: 41.00 | 15: iteration 98910/ 125429 | consumed samples: 25320960 | consumed tokens: 51857326080 | elapsed time per iteration (s): 1.05 | learning rate: 3.951E-05 | global batch size: 256 | lm loss: 1.927095E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.631 | TFLOPs: 40.26 | 15: iteration 98920/ 125429 | consumed samples: 25323520 | consumed tokens: 51862568960 | elapsed time per iteration (s): 1.02 | learning rate: 3.949E-05 | global batch size: 256 | lm loss: 1.969438E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.976 | TFLOPs: 41.31 | 15: iteration 98930/ 125429 | consumed samples: 25326080 | consumed tokens: 51867811840 | elapsed time per iteration (s): 1.06 | learning rate: 3.948E-05 | global batch size: 256 | lm loss: 1.902999E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.516 | TFLOPs: 40.08 | 15: iteration 98940/ 125429 | consumed samples: 25328640 | consumed tokens: 51873054720 | elapsed time per iteration (s): 1.04 | learning rate: 3.947E-05 | global batch size: 256 | lm loss: 1.920161E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.676 | TFLOPs: 40.60 | 15: iteration 98950/ 125429 | consumed samples: 25331200 | consumed tokens: 51878297600 | elapsed time per iteration (s): 1.03 | learning rate: 3.945E-05 | global batch size: 256 | lm loss: 1.904771E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.049 | TFLOPs: 41.16 | 15: iteration 98960/ 125429 | consumed samples: 25333760 | consumed tokens: 51883540480 | elapsed time per iteration (s): 1.04 | learning rate: 3.944E-05 | global batch size: 256 | lm loss: 1.887976E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.943 | TFLOPs: 40.81 | 15: iteration 98970/ 125429 | consumed samples: 25336320 | consumed tokens: 51888783360 | elapsed time per iteration (s): 1.02 | learning rate: 3.942E-05 | global batch size: 256 | lm loss: 1.906524E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.917 | TFLOPs: 41.47 | 15: iteration 98980/ 125429 | consumed samples: 25338880 | consumed tokens: 51894026240 | elapsed time per iteration (s): 1.03 | learning rate: 3.941E-05 | global batch size: 256 | lm loss: 1.917373E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.653 | TFLOPs: 41.09 | 15: iteration 98990/ 125429 | consumed samples: 25341440 | consumed tokens: 51899269120 | elapsed time per iteration (s): 1.04 | learning rate: 3.939E-05 | global batch size: 256 | lm loss: 1.914392E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.226 | TFLOPs: 40.69 | 15: iteration 99000/ 125429 | consumed samples: 25344000 | consumed tokens: 51904512000 | elapsed time per iteration (s): 1.04 | learning rate: 3.938E-05 | global batch size: 256 | lm loss: 1.913981E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.263 | TFLOPs: 40.86 | 15: ------------------------------------------------------------------------------------------- 15: valid loss at iteration 99000 | lm loss value: 1.976692E+00 | lm loss PPL: 7.218827E+00 | 15: ------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 99000 to checkpoints_1b5 0: [2022-11-27 01:21:14,961] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step99000 is begin to save! 0: [2022-11-27 01:21:14,969] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:21:15,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:21:15,224] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:21:15,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:21:15,343] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:21:15,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:21:15,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:21:15,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:21:15,580] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:21:15,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:21:15,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:21:15,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:21:15,818] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:21:15,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:21:15,936] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:21:16,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:21:16,054] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:21:16,169] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:21:16,170] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:21:16,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:21:16,292] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:21:16,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:21:16,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:21:16,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:21:16,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:21:16,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:21:16,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:21:16,759] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:21:16,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:21:16,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:21:16,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:21:16,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:21:16,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:21:17,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:21:17,091] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:21:17,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:21:17,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:21:17,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:21:17,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:21:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:21:17,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:21:17,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:21:17,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:21:17,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:21:17,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:21:17,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:21:17,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:21:17,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:21:17,887] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:21:17,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:21:17,995] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:21:18,110] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:21:18,110] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:21:18,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:21:18,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_29-model_00-model_states.pt... 0: [2022-11-27 01:21:18,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_29-model_00-model_states.pt. 0: [2022-11-27 01:21:18,327] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:21:18,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:21:18,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/layer_32-model_00-model_states.pt... 0: [2022-11-27 01:21:18,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/layer_32-model_00-model_states.pt. 0: [2022-11-27 01:21:18,442] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step99000/mp_rank_00_model_states.pt 0: [2022-11-27 01:21:18,442] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:21:18,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:21:18,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step99000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:21:18,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:21:18,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 01:21:18,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:21:18,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:21:18,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:21:18,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 01:21:18,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:21:18,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 6: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:21:18,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-27 01:21:18,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-27 01:21:18,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,671] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,674] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 01:21:18,674] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,662] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-27 01:21:18,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,676] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,676] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,676] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:21:18,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:21:18,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:21:18,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:21:18,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:21:18,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:21:18,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:21:18,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 10: [2022-11-27 01:21:18,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 01:21:18,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 10: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 10: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:21:18,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,683] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,683] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-27 01:21:18,684] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,684] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,684] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:21:18,685] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,685] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 01:21:18,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 01:21:18,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:21:18,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 2: [2022-11-27 01:21:18,689] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:21:18,689] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 01:21:18,689] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-27 01:21:18,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,691] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,691] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:21:18,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:21:18,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 14: [2022-11-27 01:21:18,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:21:18,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 01:21:18,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,693] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,693] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-27 01:21:18,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-27 01:21:18,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,694] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:21:18,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,694] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:21:18,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 1: [2022-11-27 01:21:18,694] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,687] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,687] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:21:18,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:21:18,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 5: [2022-11-27 01:21:18,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:21:18,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:21:18,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 6: [2022-11-27 01:21:18,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:21:18,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:21:18,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,700] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,700] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-27 01:21:18,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:21:18,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:21:18,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 12: [2022-11-27 01:21:18,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 8: [2022-11-27 01:21:18,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:21:18,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:21:18,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 11: [2022-11-27 01:21:18,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:21:18,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 01:21:18,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 7: [2022-11-27 01:21:18,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:21:18,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:21:18,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 13: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:21:18,714] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 01:21:18,714] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 01:21:18,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 01:21:18,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:21:18,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:21:18,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:21:18,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 9: [2022-11-27 01:21:18,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 15: [2022-11-27 01:21:18,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:21:18,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:21:18,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 4: [2022-11-27 01:21:18,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:21:18,703] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 01:21:18,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 3: [2022-11-27 01:21:18,758] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:21:18,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,770] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,770] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: [2022-11-27 01:21:18,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step99000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 01:21:18,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step99000 is ready now! 0: successfully saved checkpoint at iteration 99000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3908.29 15: iteration 99010/ 125429 | consumed samples: 25346560 | consumed tokens: 51909754880 | elapsed time per iteration (s): 1.50 | learning rate: 3.937E-05 | global batch size: 256 | lm loss: 1.900089E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 170.481 | TFLOPs: 28.17 | 15: iteration 99020/ 125429 | consumed samples: 25349120 | consumed tokens: 51914997760 | elapsed time per iteration (s): 1.04 | learning rate: 3.935E-05 | global batch size: 256 | lm loss: 1.924197E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.776 | TFLOPs: 40.62 | 15: iteration 99030/ 125429 | consumed samples: 25351680 | consumed tokens: 51920240640 | elapsed time per iteration (s): 1.04 | learning rate: 3.934E-05 | global batch size: 256 | lm loss: 1.901699E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.015 | TFLOPs: 40.49 | 15: iteration 99040/ 125429 | consumed samples: 25354240 | consumed tokens: 51925483520 | elapsed time per iteration (s): 1.18 | learning rate: 3.932E-05 | global batch size: 256 | lm loss: 1.922326E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.773 | TFLOPs: 35.99 | 15: iteration 99050/ 125429 | consumed samples: 25356800 | consumed tokens: 51930726400 | elapsed time per iteration (s): 1.05 | learning rate: 3.931E-05 | global batch size: 256 | lm loss: 1.919400E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.926 | TFLOPs: 40.48 | 15: iteration 99060/ 125429 | consumed samples: 25359360 | consumed tokens: 51935969280 | elapsed time per iteration (s): 1.03 | learning rate: 3.930E-05 | global batch size: 256 | lm loss: 1.895104E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.476 | TFLOPs: 41.06 | 15: iteration 99070/ 125429 | consumed samples: 25361920 | consumed tokens: 51941212160 | elapsed time per iteration (s): 1.02 | learning rate: 3.928E-05 | global batch size: 256 | lm loss: 1.898773E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.873 | TFLOPs: 41.29 | 15: iteration 99080/ 125429 | consumed samples: 25364480 | consumed tokens: 51946455040 | elapsed time per iteration (s): 1.03 | learning rate: 3.927E-05 | global batch size: 256 | lm loss: 1.913462E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.814 | TFLOPs: 41.12 | 15: iteration 99090/ 125429 | consumed samples: 25367040 | consumed tokens: 51951697920 | elapsed time per iteration (s): 1.06 | learning rate: 3.925E-05 | global batch size: 256 | lm loss: 1.919690E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.367 | TFLOPs: 39.89 | 15: iteration 99100/ 125429 | consumed samples: 25369600 | consumed tokens: 51956940800 | elapsed time per iteration (s): 1.04 | learning rate: 3.924E-05 | global batch size: 256 | lm loss: 1.901690E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.568 | TFLOPs: 40.75 | 15: iteration 99110/ 125429 | consumed samples: 25372160 | consumed tokens: 51962183680 | elapsed time per iteration (s): 1.05 | learning rate: 3.923E-05 | global batch size: 256 | lm loss: 1.901341E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.771 | TFLOPs: 40.45 | 15: iteration 99120/ 125429 | consumed samples: 25374720 | consumed tokens: 51967426560 | elapsed time per iteration (s): 1.03 | learning rate: 3.921E-05 | global batch size: 256 | lm loss: 1.884384E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.788 | TFLOPs: 40.95 | 15: iteration 99130/ 125429 | consumed samples: 25377280 | consumed tokens: 51972669440 | elapsed time per iteration (s): 1.05 | learning rate: 3.920E-05 | global batch size: 256 | lm loss: 1.901467E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.700 | TFLOPs: 40.27 | 15: iteration 99140/ 125429 | consumed samples: 25379840 | consumed tokens: 51977912320 | elapsed time per iteration (s): 1.05 | learning rate: 3.918E-05 | global batch size: 256 | lm loss: 1.921683E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.009 | TFLOPs: 40.16 | 15: iteration 99150/ 125429 | consumed samples: 25382400 | consumed tokens: 51983155200 | elapsed time per iteration (s): 1.04 | learning rate: 3.917E-05 | global batch size: 256 | lm loss: 1.914497E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.114 | TFLOPs: 40.67 | 15: iteration 99160/ 125429 | consumed samples: 25384960 | consumed tokens: 51988398080 | elapsed time per iteration (s): 1.19 | learning rate: 3.916E-05 | global batch size: 256 | lm loss: 1.924476E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.362 | TFLOPs: 35.59 | 15: iteration 99170/ 125429 | consumed samples: 25387520 | consumed tokens: 51993640960 | elapsed time per iteration (s): 1.04 | learning rate: 3.914E-05 | global batch size: 256 | lm loss: 1.910196E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.026 | TFLOPs: 40.82 | 15: iteration 99180/ 125429 | consumed samples: 25390080 | consumed tokens: 51998883840 | elapsed time per iteration (s): 1.07 | learning rate: 3.913E-05 | global batch size: 256 | lm loss: 1.902126E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.991 | TFLOPs: 39.66 | 15: iteration 99190/ 125429 | consumed samples: 25392640 | consumed tokens: 52004126720 | elapsed time per iteration (s): 1.04 | learning rate: 3.911E-05 | global batch size: 256 | lm loss: 1.913104E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.207 | TFLOPs: 40.52 | 15: iteration 99200/ 125429 | consumed samples: 25395200 | consumed tokens: 52009369600 | elapsed time per iteration (s): 1.04 | learning rate: 3.910E-05 | global batch size: 256 | lm loss: 1.889825E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.032 | TFLOPs: 40.66 | 15: iteration 99210/ 125429 | consumed samples: 25397760 | consumed tokens: 52014612480 | elapsed time per iteration (s): 1.06 | learning rate: 3.909E-05 | global batch size: 256 | lm loss: 1.907726E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.279 | TFLOPs: 39.87 | 15: iteration 99220/ 125429 | consumed samples: 25400320 | consumed tokens: 52019855360 | elapsed time per iteration (s): 1.04 | learning rate: 3.907E-05 | global batch size: 256 | lm loss: 1.915813E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.990 | TFLOPs: 40.82 | 15: iteration 99230/ 125429 | consumed samples: 25402880 | consumed tokens: 52025098240 | elapsed time per iteration (s): 1.02 | learning rate: 3.906E-05 | global batch size: 256 | lm loss: 1.931402E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.869 | TFLOPs: 41.29 | 15: iteration 99240/ 125429 | consumed samples: 25405440 | consumed tokens: 52030341120 | elapsed time per iteration (s): 1.02 | learning rate: 3.904E-05 | global batch size: 256 | lm loss: 1.922551E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.836 | TFLOPs: 41.45 | 15: iteration 99250/ 125429 | consumed samples: 25408000 | consumed tokens: 52035584000 | elapsed time per iteration (s): 1.02 | learning rate: 3.903E-05 | global batch size: 256 | lm loss: 1.899237E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.374 | TFLOPs: 41.38 | 15: iteration 99260/ 125429 | consumed samples: 25410560 | consumed tokens: 52040826880 | elapsed time per iteration (s): 1.02 | learning rate: 3.902E-05 | global batch size: 256 | lm loss: 1.896884E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.152 | TFLOPs: 41.34 | 15: iteration 99270/ 125429 | consumed samples: 25413120 | consumed tokens: 52046069760 | elapsed time per iteration (s): 1.04 | learning rate: 3.900E-05 | global batch size: 256 | lm loss: 1.919090E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.524 | TFLOPs: 40.57 | 15: iteration 99280/ 125429 | consumed samples: 25415680 | consumed tokens: 52051312640 | elapsed time per iteration (s): 1.05 | learning rate: 3.899E-05 | global batch size: 256 | lm loss: 1.936685E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.703 | TFLOPs: 40.44 | 15: iteration 99290/ 125429 | consumed samples: 25418240 | consumed tokens: 52056555520 | elapsed time per iteration (s): 1.04 | learning rate: 3.897E-05 | global batch size: 256 | lm loss: 1.892604E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.415 | TFLOPs: 40.72 | 15: iteration 99300/ 125429 | consumed samples: 25420800 | consumed tokens: 52061798400 | elapsed time per iteration (s): 1.03 | learning rate: 3.896E-05 | global batch size: 256 | lm loss: 1.947186E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.541 | TFLOPs: 41.24 | 15: iteration 99310/ 125429 | consumed samples: 25423360 | consumed tokens: 52067041280 | elapsed time per iteration (s): 1.02 | learning rate: 3.895E-05 | global batch size: 256 | lm loss: 1.907188E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.914 | TFLOPs: 41.47 | 15: iteration 99320/ 125429 | consumed samples: 25425920 | consumed tokens: 52072284160 | elapsed time per iteration (s): 1.04 | learning rate: 3.893E-05 | global batch size: 256 | lm loss: 1.917159E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.625 | TFLOPs: 40.59 | 15: iteration 99330/ 125429 | consumed samples: 25428480 | consumed tokens: 52077527040 | elapsed time per iteration (s): 1.19 | learning rate: 3.892E-05 | global batch size: 256 | lm loss: 1.922888E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.273 | TFLOPs: 35.58 | 15: iteration 99340/ 125429 | consumed samples: 25431040 | consumed tokens: 52082769920 | elapsed time per iteration (s): 1.05 | learning rate: 3.890E-05 | global batch size: 256 | lm loss: 1.910990E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.369 | TFLOPs: 40.22 | 15: iteration 99350/ 125429 | consumed samples: 25433600 | consumed tokens: 52088012800 | elapsed time per iteration (s): 1.02 | learning rate: 3.889E-05 | global batch size: 256 | lm loss: 1.894113E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.747 | TFLOPs: 41.44 | 15: iteration 99360/ 125429 | consumed samples: 25436160 | consumed tokens: 52093255680 | elapsed time per iteration (s): 1.08 | learning rate: 3.888E-05 | global batch size: 256 | lm loss: 1.919770E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.217 | TFLOPs: 39.04 | 15: iteration 99370/ 125429 | consumed samples: 25438720 | consumed tokens: 52098498560 | elapsed time per iteration (s): 1.05 | learning rate: 3.886E-05 | global batch size: 256 | lm loss: 1.897907E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.048 | TFLOPs: 40.17 | 15: iteration 99380/ 125429 | consumed samples: 25441280 | consumed tokens: 52103741440 | elapsed time per iteration (s): 1.04 | learning rate: 3.885E-05 | global batch size: 256 | lm loss: 1.918413E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.212 | TFLOPs: 40.85 | 15: iteration 99390/ 125429 | consumed samples: 25443840 | consumed tokens: 52108984320 | elapsed time per iteration (s): 1.03 | learning rate: 3.883E-05 | global batch size: 256 | lm loss: 1.931194E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.704 | TFLOPs: 41.10 | 15: iteration 99400/ 125429 | consumed samples: 25446400 | consumed tokens: 52114227200 | elapsed time per iteration (s): 1.02 | learning rate: 3.882E-05 | global batch size: 256 | lm loss: 1.925817E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.549 | TFLOPs: 41.41 | 15: iteration 99410/ 125429 | consumed samples: 25448960 | consumed tokens: 52119470080 | elapsed time per iteration (s): 1.04 | learning rate: 3.881E-05 | global batch size: 256 | lm loss: 1.916366E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.413 | TFLOPs: 40.72 | 15: iteration 99420/ 125429 | consumed samples: 25451520 | consumed tokens: 52124712960 | elapsed time per iteration (s): 1.18 | learning rate: 3.879E-05 | global batch size: 256 | lm loss: 1.913657E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.151 | TFLOPs: 35.89 | 15: iteration 99430/ 125429 | consumed samples: 25454080 | consumed tokens: 52129955840 | elapsed time per iteration (s): 1.17 | learning rate: 3.878E-05 | global batch size: 256 | lm loss: 1.909888E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 218.403 | TFLOPs: 36.09 | 15: iteration 99440/ 125429 | consumed samples: 25456640 | consumed tokens: 52135198720 | elapsed time per iteration (s): 1.03 | learning rate: 3.876E-05 | global batch size: 256 | lm loss: 1.927561E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.895 | TFLOPs: 41.13 | 15: iteration 99450/ 125429 | consumed samples: 25459200 | consumed tokens: 52140441600 | elapsed time per iteration (s): 1.18 | learning rate: 3.875E-05 | global batch size: 256 | lm loss: 1.918964E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.085 | TFLOPs: 35.87 | 15: iteration 99460/ 125429 | consumed samples: 25461760 | consumed tokens: 52145684480 | elapsed time per iteration (s): 1.04 | learning rate: 3.874E-05 | global batch size: 256 | lm loss: 1.879211E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.381 | TFLOPs: 40.72 | 15: iteration 99470/ 125429 | consumed samples: 25464320 | consumed tokens: 52150927360 | elapsed time per iteration (s): 1.04 | learning rate: 3.872E-05 | global batch size: 256 | lm loss: 1.927971E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.016 | TFLOPs: 40.49 | 15: iteration 99480/ 125429 | consumed samples: 25466880 | consumed tokens: 52156170240 | elapsed time per iteration (s): 1.06 | learning rate: 3.871E-05 | global batch size: 256 | lm loss: 1.915522E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.810 | TFLOPs: 39.96 | 15: iteration 99490/ 125429 | consumed samples: 25469440 | consumed tokens: 52161413120 | elapsed time per iteration (s): 1.04 | learning rate: 3.869E-05 | global batch size: 256 | lm loss: 1.901424E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.154 | TFLOPs: 40.51 | 15: iteration 99500/ 125429 | consumed samples: 25472000 | consumed tokens: 52166656000 | elapsed time per iteration (s): 1.05 | learning rate: 3.868E-05 | global batch size: 256 | lm loss: 1.932673E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.884 | TFLOPs: 40.14 | 15: iteration 99510/ 125429 | consumed samples: 25474560 | consumed tokens: 52171898880 | elapsed time per iteration (s): 1.06 | learning rate: 3.867E-05 | global batch size: 256 | lm loss: 1.926341E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.085 | TFLOPs: 40.01 | 15: iteration 99520/ 125429 | consumed samples: 25477120 | consumed tokens: 52177141760 | elapsed time per iteration (s): 1.03 | learning rate: 3.865E-05 | global batch size: 256 | lm loss: 1.918674E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.884 | TFLOPs: 41.13 | 15: iteration 99530/ 125429 | consumed samples: 25479680 | consumed tokens: 52182384640 | elapsed time per iteration (s): 1.10 | learning rate: 3.864E-05 | global batch size: 256 | lm loss: 1.898281E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.918 | TFLOPs: 38.49 | 15: iteration 99540/ 125429 | consumed samples: 25482240 | consumed tokens: 52187627520 | elapsed time per iteration (s): 1.07 | learning rate: 3.862E-05 | global batch size: 256 | lm loss: 1.909616E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.700 | TFLOPs: 39.45 | 15: iteration 99550/ 125429 | consumed samples: 25484800 | consumed tokens: 52192870400 | elapsed time per iteration (s): 1.04 | learning rate: 3.861E-05 | global batch size: 256 | lm loss: 1.904968E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.525 | TFLOPs: 40.57 | 15: iteration 99560/ 125429 | consumed samples: 25487360 | consumed tokens: 52198113280 | elapsed time per iteration (s): 1.05 | learning rate: 3.860E-05 | global batch size: 256 | lm loss: 1.889939E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.505 | TFLOPs: 40.24 | 15: iteration 99570/ 125429 | consumed samples: 25489920 | consumed tokens: 52203356160 | elapsed time per iteration (s): 1.03 | learning rate: 3.858E-05 | global batch size: 256 | lm loss: 1.931837E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.524 | TFLOPs: 40.91 | 15: iteration 99580/ 125429 | consumed samples: 25492480 | consumed tokens: 52208599040 | elapsed time per iteration (s): 1.02 | learning rate: 3.857E-05 | global batch size: 256 | lm loss: 1.879862E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.229 | TFLOPs: 41.35 | 15: iteration 99590/ 125429 | consumed samples: 25495040 | consumed tokens: 52213841920 | elapsed time per iteration (s): 1.03 | learning rate: 3.856E-05 | global batch size: 256 | lm loss: 1.891704E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.463 | TFLOPs: 41.06 | 15: iteration 99600/ 125429 | consumed samples: 25497600 | consumed tokens: 52219084800 | elapsed time per iteration (s): 1.03 | learning rate: 3.854E-05 | global batch size: 256 | lm loss: 1.901100E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.449 | TFLOPs: 41.22 | 15: iteration 99610/ 125429 | consumed samples: 25500160 | consumed tokens: 52224327680 | elapsed time per iteration (s): 1.02 | learning rate: 3.853E-05 | global batch size: 256 | lm loss: 1.906668E+00 | grad norm: 0.142 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.886 | TFLOPs: 41.30 | 15: iteration 99620/ 125429 | consumed samples: 25502720 | consumed tokens: 52229570560 | elapsed time per iteration (s): 1.03 | learning rate: 3.851E-05 | global batch size: 256 | lm loss: 1.893517E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.395 | TFLOPs: 40.88 | 15: iteration 99630/ 125429 | consumed samples: 25505280 | consumed tokens: 52234813440 | elapsed time per iteration (s): 1.04 | learning rate: 3.850E-05 | global batch size: 256 | lm loss: 1.919955E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.038 | TFLOPs: 40.49 | 15: iteration 99640/ 125429 | consumed samples: 25507840 | consumed tokens: 52240056320 | elapsed time per iteration (s): 1.04 | learning rate: 3.849E-05 | global batch size: 256 | lm loss: 1.931699E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.339 | TFLOPs: 40.71 | 15: iteration 99650/ 125429 | consumed samples: 25510400 | consumed tokens: 52245299200 | elapsed time per iteration (s): 1.05 | learning rate: 3.847E-05 | global batch size: 256 | lm loss: 1.917043E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.427 | TFLOPs: 40.23 | 15: iteration 99660/ 125429 | consumed samples: 25512960 | consumed tokens: 52250542080 | elapsed time per iteration (s): 1.06 | learning rate: 3.846E-05 | global batch size: 256 | lm loss: 1.912424E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.644 | TFLOPs: 39.93 | 15: iteration 99670/ 125429 | consumed samples: 25515520 | consumed tokens: 52255784960 | elapsed time per iteration (s): 1.03 | learning rate: 3.845E-05 | global batch size: 256 | lm loss: 1.899096E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.375 | TFLOPs: 40.88 | 15: iteration 99680/ 125429 | consumed samples: 25518080 | consumed tokens: 52261027840 | elapsed time per iteration (s): 1.08 | learning rate: 3.843E-05 | global batch size: 256 | lm loss: 1.934431E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.312 | TFLOPs: 39.22 | 15: iteration 99690/ 125429 | consumed samples: 25520640 | consumed tokens: 52266270720 | elapsed time per iteration (s): 1.03 | learning rate: 3.842E-05 | global batch size: 256 | lm loss: 1.935593E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.803 | TFLOPs: 40.95 | 15: iteration 99700/ 125429 | consumed samples: 25523200 | consumed tokens: 52271513600 | elapsed time per iteration (s): 1.05 | learning rate: 3.840E-05 | global batch size: 256 | lm loss: 1.915369E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.043 | TFLOPs: 40.33 | 15: iteration 99710/ 125429 | consumed samples: 25525760 | consumed tokens: 52276756480 | elapsed time per iteration (s): 1.03 | learning rate: 3.839E-05 | global batch size: 256 | lm loss: 1.919277E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.370 | TFLOPs: 41.05 | 15: iteration 99720/ 125429 | consumed samples: 25528320 | consumed tokens: 52281999360 | elapsed time per iteration (s): 1.06 | learning rate: 3.838E-05 | global batch size: 256 | lm loss: 1.939347E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.164 | TFLOPs: 39.85 | 15: iteration 99730/ 125429 | consumed samples: 25530880 | consumed tokens: 52287242240 | elapsed time per iteration (s): 1.05 | learning rate: 3.836E-05 | global batch size: 256 | lm loss: 1.910716E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.690 | TFLOPs: 40.27 | 15: iteration 99740/ 125429 | consumed samples: 25533440 | consumed tokens: 52292485120 | elapsed time per iteration (s): 1.03 | learning rate: 3.835E-05 | global batch size: 256 | lm loss: 1.911179E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.491 | TFLOPs: 41.23 | 15: iteration 99750/ 125429 | consumed samples: 25536000 | consumed tokens: 52297728000 | elapsed time per iteration (s): 1.05 | learning rate: 3.833E-05 | global batch size: 256 | lm loss: 1.919699E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.836 | TFLOPs: 40.46 | 15: iteration 99760/ 125429 | consumed samples: 25538560 | consumed tokens: 52302970880 | elapsed time per iteration (s): 1.07 | learning rate: 3.832E-05 | global batch size: 256 | lm loss: 1.891614E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.546 | TFLOPs: 39.59 | 15: iteration 99770/ 125429 | consumed samples: 25541120 | consumed tokens: 52308213760 | elapsed time per iteration (s): 1.04 | learning rate: 3.831E-05 | global batch size: 256 | lm loss: 1.918398E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.163 | TFLOPs: 40.85 | 15: iteration 99780/ 125429 | consumed samples: 25543680 | consumed tokens: 52313456640 | elapsed time per iteration (s): 1.06 | learning rate: 3.829E-05 | global batch size: 256 | lm loss: 1.907838E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.459 | TFLOPs: 40.07 | 15: iteration 99790/ 125429 | consumed samples: 25546240 | consumed tokens: 52318699520 | elapsed time per iteration (s): 1.04 | learning rate: 3.828E-05 | global batch size: 256 | lm loss: 1.904106E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.577 | TFLOPs: 40.58 | 15: iteration 99800/ 125429 | consumed samples: 25548800 | consumed tokens: 52323942400 | elapsed time per iteration (s): 1.04 | learning rate: 3.827E-05 | global batch size: 256 | lm loss: 1.912585E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.554 | TFLOPs: 40.58 | 15: iteration 99810/ 125429 | consumed samples: 25551360 | consumed tokens: 52329185280 | elapsed time per iteration (s): 1.04 | learning rate: 3.825E-05 | global batch size: 256 | lm loss: 1.933716E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.019 | TFLOPs: 40.49 | 15: iteration 99820/ 125429 | consumed samples: 25553920 | consumed tokens: 52334428160 | elapsed time per iteration (s): 1.05 | learning rate: 3.824E-05 | global batch size: 256 | lm loss: 1.902376E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.374 | TFLOPs: 40.22 | 15: iteration 99830/ 125429 | consumed samples: 25556480 | consumed tokens: 52339671040 | elapsed time per iteration (s): 1.04 | learning rate: 3.822E-05 | global batch size: 256 | lm loss: 1.910967E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.675 | TFLOPs: 40.60 | 15: iteration 99840/ 125429 | consumed samples: 25559040 | consumed tokens: 52344913920 | elapsed time per iteration (s): 1.05 | learning rate: 3.821E-05 | global batch size: 256 | lm loss: 1.892071E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.350 | TFLOPs: 40.22 | 15: iteration 99850/ 125429 | consumed samples: 25561600 | consumed tokens: 52350156800 | elapsed time per iteration (s): 1.09 | learning rate: 3.820E-05 | global batch size: 256 | lm loss: 1.914748E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.931 | TFLOPs: 38.82 | 15: iteration 99860/ 125429 | consumed samples: 25564160 | consumed tokens: 52355399680 | elapsed time per iteration (s): 1.05 | learning rate: 3.818E-05 | global batch size: 256 | lm loss: 1.919490E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.911 | TFLOPs: 40.47 | 15: iteration 99870/ 125429 | consumed samples: 25566720 | consumed tokens: 52360642560 | elapsed time per iteration (s): 1.09 | learning rate: 3.817E-05 | global batch size: 256 | lm loss: 1.923454E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.288 | TFLOPs: 38.88 | 15: iteration 99880/ 125429 | consumed samples: 25569280 | consumed tokens: 52365885440 | elapsed time per iteration (s): 1.06 | learning rate: 3.816E-05 | global batch size: 256 | lm loss: 1.936645E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.056 | TFLOPs: 39.84 | 15: iteration 99890/ 125429 | consumed samples: 25571840 | consumed tokens: 52371128320 | elapsed time per iteration (s): 1.07 | learning rate: 3.814E-05 | global batch size: 256 | lm loss: 1.867652E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.149 | TFLOPs: 39.52 | 15: iteration 99900/ 125429 | consumed samples: 25574400 | consumed tokens: 52376371200 | elapsed time per iteration (s): 1.07 | learning rate: 3.813E-05 | global batch size: 256 | lm loss: 1.926593E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.229 | TFLOPs: 39.70 | 15: iteration 99910/ 125429 | consumed samples: 25576960 | consumed tokens: 52381614080 | elapsed time per iteration (s): 1.04 | learning rate: 3.811E-05 | global batch size: 256 | lm loss: 1.903425E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.120 | TFLOPs: 40.51 | 15: iteration 99920/ 125429 | consumed samples: 25579520 | consumed tokens: 52386856960 | elapsed time per iteration (s): 1.04 | learning rate: 3.810E-05 | global batch size: 256 | lm loss: 1.910996E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.223 | TFLOPs: 40.69 | 15: iteration 99930/ 125429 | consumed samples: 25582080 | consumed tokens: 52392099840 | elapsed time per iteration (s): 1.07 | learning rate: 3.809E-05 | global batch size: 256 | lm loss: 1.908098E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.200 | TFLOPs: 39.36 | 15: iteration 99940/ 125429 | consumed samples: 25584640 | consumed tokens: 52397342720 | elapsed time per iteration (s): 1.05 | learning rate: 3.807E-05 | global batch size: 256 | lm loss: 1.918317E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.088 | TFLOPs: 40.17 | 15: iteration 99950/ 125429 | consumed samples: 25587200 | consumed tokens: 52402585600 | elapsed time per iteration (s): 1.05 | learning rate: 3.806E-05 | global batch size: 256 | lm loss: 1.904556E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.495 | TFLOPs: 40.40 | 15: iteration 99960/ 125429 | consumed samples: 25589760 | consumed tokens: 52407828480 | elapsed time per iteration (s): 1.05 | learning rate: 3.805E-05 | global batch size: 256 | lm loss: 1.919694E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.505 | TFLOPs: 40.41 | 15: iteration 99970/ 125429 | consumed samples: 25592320 | consumed tokens: 52413071360 | elapsed time per iteration (s): 1.04 | learning rate: 3.803E-05 | global batch size: 256 | lm loss: 1.912958E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.296 | TFLOPs: 40.54 | 15: iteration 99980/ 125429 | consumed samples: 25594880 | consumed tokens: 52418314240 | elapsed time per iteration (s): 1.03 | learning rate: 3.802E-05 | global batch size: 256 | lm loss: 1.889502E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.740 | TFLOPs: 41.11 | 15: iteration 99990/ 125429 | consumed samples: 25597440 | consumed tokens: 52423557120 | elapsed time per iteration (s): 1.05 | learning rate: 3.801E-05 | global batch size: 256 | lm loss: 1.891893E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.884 | TFLOPs: 40.30 | 0: [2022-11-27 01:38:52,512] [INFO] [logging.py:68:log_dist] [Rank 0] step=100000, skipped=0, lr=[3.799179994190487e-05, 3.799179994190487e-05, 3.799179994190487e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 100000/ 125429 | consumed samples: 25600000 | consumed tokens: 52428800000 | elapsed time per iteration (s): 1.08 | learning rate: 3.799E-05 | global batch size: 256 | lm loss: 1.899096E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.085 | TFLOPs: 39.35 | 0: steps: 100000 loss: 1.9544 iter time (s): 1.050 samples/sec: 243.796 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 100000 | lm loss value: 1.913801E+00 | lm loss PPL: 6.778809E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 100000 to checkpoints_1b5 0: [2022-11-27 01:38:52,861] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step100000 is begin to save! 0: [2022-11-27 01:38:52,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:38:53,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:38:53,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:38:53,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:38:53,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:38:53,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:38:53,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:38:53,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:38:53,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:38:53,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:38:53,576] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:38:53,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:38:53,680] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:38:53,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:38:53,788] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:38:53,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:38:53,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:38:54,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:38:54,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:38:54,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:38:54,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:38:54,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:38:54,216] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:38:54,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:38:54,321] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:38:54,429] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:38:54,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:38:54,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:38:54,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:38:54,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:38:54,636] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:38:54,748] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:38:54,748] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:38:54,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:38:54,851] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:38:54,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:38:54,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:38:55,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:38:55,058] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:38:55,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:38:55,164] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:38:55,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:38:55,272] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:38:55,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:38:55,374] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:38:55,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:38:55,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:38:55,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:38:55,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:38:55,687] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:38:55,688] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:38:55,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:38:55,793] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:38:55,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:38:55,899] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_29-model_00-model_states.pt... 0: [2022-11-27 01:38:56,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_29-model_00-model_states.pt. 0: [2022-11-27 01:38:56,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:38:56,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:38:56,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/layer_32-model_00-model_states.pt... 0: [2022-11-27 01:38:56,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/layer_32-model_00-model_states.pt. 0: [2022-11-27 01:38:56,117] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step100000/mp_rank_00_model_states.pt 0: [2022-11-27 01:38:56,117] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:38:56,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:38:56,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step100000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:38:56,328] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-27 01:38:56,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-27 01:38:56,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-27 01:38:56,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-27 01:38:56,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-27 01:38:56,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-27 01:38:56,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,333] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,333] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,333] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 8: [2022-11-27 01:38:56,351] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:38:56,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 01:38:56,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-27 01:38:56,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-27 01:38:56,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,362] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,362] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,362] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 6: [2022-11-27 01:38:56,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:38:56,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:38:56,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:38:56,367] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,367] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,367] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 3: [2022-11-27 01:38:56,368] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,365] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 3: [2022-11-27 01:38:56,368] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,365] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 9: [2022-11-27 01:38:56,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:38:56,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-27 01:38:56,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:38:56,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 4: [2022-11-27 01:38:56,372] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:38:56,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,356] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 14: [2022-11-27 01:38:56,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:38:56,376] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 01:38:56,376] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 10: [2022-11-27 01:38:56,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:38:56,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:38:56,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 2: [2022-11-27 01:38:56,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:38:56,380] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:38:56,380] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 13: [2022-11-27 01:38:56,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:38:56,381] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 01:38:56,381] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 12: [2022-11-27 01:38:56,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:38:56,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:38:56,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,366] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,366] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,366] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 1: [2022-11-27 01:38:56,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 5: [2022-11-27 01:38:56,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:38:56,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:38:56,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,447] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:38:56,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 11: [2022-11-27 01:38:56,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:38:56,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:38:56,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 15: [2022-11-27 01:38:56,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: [2022-11-27 01:38:56,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 01:38:56,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 7: [2022-11-27 01:38:56,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:38:56,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step100000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:38:56,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step100000 is ready now! 0: successfully saved checkpoint at iteration 100000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3778.73 15: iteration 100010/ 125429 | consumed samples: 25602560 | consumed tokens: 52434042880 | elapsed time per iteration (s): 1.46 | learning rate: 3.798E-05 | global batch size: 256 | lm loss: 1.950981E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.728 | TFLOPs: 29.04 | 15: iteration 100020/ 125429 | consumed samples: 25605120 | consumed tokens: 52439285760 | elapsed time per iteration (s): 1.03 | learning rate: 3.796E-05 | global batch size: 256 | lm loss: 1.937067E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.112 | TFLOPs: 41.17 | 15: iteration 100030/ 125429 | consumed samples: 25607680 | consumed tokens: 52444528640 | elapsed time per iteration (s): 1.07 | learning rate: 3.795E-05 | global batch size: 256 | lm loss: 1.922215E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.428 | TFLOPs: 39.57 | 15: iteration 100040/ 125429 | consumed samples: 25610240 | consumed tokens: 52449771520 | elapsed time per iteration (s): 1.04 | learning rate: 3.794E-05 | global batch size: 256 | lm loss: 1.944060E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.298 | TFLOPs: 40.87 | 15: iteration 100050/ 125429 | consumed samples: 25612800 | consumed tokens: 52455014400 | elapsed time per iteration (s): 1.05 | learning rate: 3.792E-05 | global batch size: 256 | lm loss: 1.907147E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.564 | TFLOPs: 40.25 | 15: iteration 100060/ 125429 | consumed samples: 25615360 | consumed tokens: 52460257280 | elapsed time per iteration (s): 1.06 | learning rate: 3.791E-05 | global batch size: 256 | lm loss: 1.894853E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.559 | TFLOPs: 39.92 | 15: iteration 100070/ 125429 | consumed samples: 25617920 | consumed tokens: 52465500160 | elapsed time per iteration (s): 1.04 | learning rate: 3.790E-05 | global batch size: 256 | lm loss: 1.910717E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.625 | TFLOPs: 40.59 | 15: iteration 100080/ 125429 | consumed samples: 25620480 | consumed tokens: 52470743040 | elapsed time per iteration (s): 1.05 | learning rate: 3.788E-05 | global batch size: 256 | lm loss: 1.929793E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.117 | TFLOPs: 40.34 | 15: iteration 100090/ 125429 | consumed samples: 25623040 | consumed tokens: 52475985920 | elapsed time per iteration (s): 1.04 | learning rate: 3.787E-05 | global batch size: 256 | lm loss: 1.914017E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.925 | TFLOPs: 40.64 | 15: iteration 100100/ 125429 | consumed samples: 25625600 | consumed tokens: 52481228800 | elapsed time per iteration (s): 1.05 | learning rate: 3.786E-05 | global batch size: 256 | lm loss: 1.912219E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.986 | TFLOPs: 40.32 | 15: iteration 100110/ 125429 | consumed samples: 25628160 | consumed tokens: 52486471680 | elapsed time per iteration (s): 1.05 | learning rate: 3.784E-05 | global batch size: 256 | lm loss: 1.924321E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.216 | TFLOPs: 40.36 | 15: iteration 100120/ 125429 | consumed samples: 25630720 | consumed tokens: 52491714560 | elapsed time per iteration (s): 1.06 | learning rate: 3.783E-05 | global batch size: 256 | lm loss: 1.947999E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.189 | TFLOPs: 40.02 | 15: iteration 100130/ 125429 | consumed samples: 25633280 | consumed tokens: 52496957440 | elapsed time per iteration (s): 1.06 | learning rate: 3.781E-05 | global batch size: 256 | lm loss: 1.901989E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.087 | TFLOPs: 40.01 | 15: iteration 100140/ 125429 | consumed samples: 25635840 | consumed tokens: 52502200320 | elapsed time per iteration (s): 1.06 | learning rate: 3.780E-05 | global batch size: 256 | lm loss: 1.891139E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.427 | TFLOPs: 40.06 | 15: iteration 100150/ 125429 | consumed samples: 25638400 | consumed tokens: 52507443200 | elapsed time per iteration (s): 1.06 | learning rate: 3.779E-05 | global batch size: 256 | lm loss: 1.921386E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.606 | TFLOPs: 39.93 | 15: iteration 100160/ 125429 | consumed samples: 25640960 | consumed tokens: 52512686080 | elapsed time per iteration (s): 1.05 | learning rate: 3.777E-05 | global batch size: 256 | lm loss: 1.921389E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.265 | TFLOPs: 40.37 | 15: iteration 100170/ 125429 | consumed samples: 25643520 | consumed tokens: 52517928960 | elapsed time per iteration (s): 1.07 | learning rate: 3.776E-05 | global batch size: 256 | lm loss: 1.915048E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.257 | TFLOPs: 39.37 | 15: iteration 100180/ 125429 | consumed samples: 25646080 | consumed tokens: 52523171840 | elapsed time per iteration (s): 1.05 | learning rate: 3.775E-05 | global batch size: 256 | lm loss: 1.913869E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.454 | TFLOPs: 40.40 | 15: iteration 100190/ 125429 | consumed samples: 25648640 | consumed tokens: 52528414720 | elapsed time per iteration (s): 1.05 | learning rate: 3.773E-05 | global batch size: 256 | lm loss: 1.904079E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.712 | TFLOPs: 40.11 | 15: iteration 100200/ 125429 | consumed samples: 25651200 | consumed tokens: 52533657600 | elapsed time per iteration (s): 1.05 | learning rate: 3.772E-05 | global batch size: 256 | lm loss: 1.937572E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.014 | TFLOPs: 40.33 | 15: iteration 100210/ 125429 | consumed samples: 25653760 | consumed tokens: 52538900480 | elapsed time per iteration (s): 1.10 | learning rate: 3.771E-05 | global batch size: 256 | lm loss: 1.909852E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.081 | TFLOPs: 38.52 | 15: iteration 100220/ 125429 | consumed samples: 25656320 | consumed tokens: 52544143360 | elapsed time per iteration (s): 1.07 | learning rate: 3.769E-05 | global batch size: 256 | lm loss: 1.894205E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.369 | TFLOPs: 39.72 | 15: iteration 100230/ 125429 | consumed samples: 25658880 | consumed tokens: 52549386240 | elapsed time per iteration (s): 1.06 | learning rate: 3.768E-05 | global batch size: 256 | lm loss: 1.888675E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.383 | TFLOPs: 40.06 | 15: iteration 100240/ 125429 | consumed samples: 25661440 | consumed tokens: 52554629120 | elapsed time per iteration (s): 1.03 | learning rate: 3.767E-05 | global batch size: 256 | lm loss: 1.895055E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.653 | TFLOPs: 40.93 | 15: iteration 100250/ 125429 | consumed samples: 25664000 | consumed tokens: 52559872000 | elapsed time per iteration (s): 1.03 | learning rate: 3.765E-05 | global batch size: 256 | lm loss: 1.916181E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.488 | TFLOPs: 41.06 | 15: iteration 100260/ 125429 | consumed samples: 25666560 | consumed tokens: 52565114880 | elapsed time per iteration (s): 1.04 | learning rate: 3.764E-05 | global batch size: 256 | lm loss: 1.901691E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.651 | TFLOPs: 40.76 | 15: iteration 100270/ 125429 | consumed samples: 25669120 | consumed tokens: 52570357760 | elapsed time per iteration (s): 1.04 | learning rate: 3.762E-05 | global batch size: 256 | lm loss: 1.921104E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.226 | TFLOPs: 40.53 | 15: iteration 100280/ 125429 | consumed samples: 25671680 | consumed tokens: 52575600640 | elapsed time per iteration (s): 1.04 | learning rate: 3.761E-05 | global batch size: 256 | lm loss: 1.897168E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.319 | TFLOPs: 40.54 | 15: iteration 100290/ 125429 | consumed samples: 25674240 | consumed tokens: 52580843520 | elapsed time per iteration (s): 1.04 | learning rate: 3.760E-05 | global batch size: 256 | lm loss: 1.907611E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.447 | TFLOPs: 40.73 | 15: iteration 100300/ 125429 | consumed samples: 25676800 | consumed tokens: 52586086400 | elapsed time per iteration (s): 1.06 | learning rate: 3.758E-05 | global batch size: 256 | lm loss: 1.931134E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.505 | TFLOPs: 39.91 | 15: iteration 100310/ 125429 | consumed samples: 25679360 | consumed tokens: 52591329280 | elapsed time per iteration (s): 1.05 | learning rate: 3.757E-05 | global batch size: 256 | lm loss: 1.898256E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.484 | TFLOPs: 40.24 | 15: iteration 100320/ 125429 | consumed samples: 25681920 | consumed tokens: 52596572160 | elapsed time per iteration (s): 1.04 | learning rate: 3.756E-05 | global batch size: 256 | lm loss: 1.905612E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.716 | TFLOPs: 40.77 | 15: iteration 100330/ 125429 | consumed samples: 25684480 | consumed tokens: 52601815040 | elapsed time per iteration (s): 1.06 | learning rate: 3.754E-05 | global batch size: 256 | lm loss: 1.916601E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.072 | TFLOPs: 39.84 | 15: iteration 100340/ 125429 | consumed samples: 25687040 | consumed tokens: 52607057920 | elapsed time per iteration (s): 1.04 | learning rate: 3.753E-05 | global batch size: 256 | lm loss: 1.939626E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.532 | TFLOPs: 40.74 | 15: iteration 100350/ 125429 | consumed samples: 25689600 | consumed tokens: 52612300800 | elapsed time per iteration (s): 1.05 | learning rate: 3.752E-05 | global batch size: 256 | lm loss: 1.908350E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.715 | TFLOPs: 40.44 | 15: iteration 100360/ 125429 | consumed samples: 25692160 | consumed tokens: 52617543680 | elapsed time per iteration (s): 1.04 | learning rate: 3.750E-05 | global batch size: 256 | lm loss: 1.924536E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.772 | TFLOPs: 40.62 | 15: iteration 100370/ 125429 | consumed samples: 25694720 | consumed tokens: 52622786560 | elapsed time per iteration (s): 1.03 | learning rate: 3.749E-05 | global batch size: 256 | lm loss: 1.912191E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.207 | TFLOPs: 41.18 | 15: iteration 100380/ 125429 | consumed samples: 25697280 | consumed tokens: 52628029440 | elapsed time per iteration (s): 1.05 | learning rate: 3.748E-05 | global batch size: 256 | lm loss: 1.931308E+00 | grad norm: 0.193 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.075 | TFLOPs: 40.17 | 15: iteration 100390/ 125429 | consumed samples: 25699840 | consumed tokens: 52633272320 | elapsed time per iteration (s): 1.03 | learning rate: 3.746E-05 | global batch size: 256 | lm loss: 1.924810E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.569 | TFLOPs: 41.24 | 15: iteration 100400/ 125429 | consumed samples: 25702400 | consumed tokens: 52638515200 | elapsed time per iteration (s): 1.04 | learning rate: 3.745E-05 | global batch size: 256 | lm loss: 1.917685E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.330 | TFLOPs: 40.71 | 15: iteration 100410/ 125429 | consumed samples: 25704960 | consumed tokens: 52643758080 | elapsed time per iteration (s): 1.03 | learning rate: 3.744E-05 | global batch size: 256 | lm loss: 1.931812E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.243 | TFLOPs: 41.02 | 15: iteration 100420/ 125429 | consumed samples: 25707520 | consumed tokens: 52649000960 | elapsed time per iteration (s): 1.03 | learning rate: 3.742E-05 | global batch size: 256 | lm loss: 1.918951E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.017 | TFLOPs: 40.99 | 15: iteration 100430/ 125429 | consumed samples: 25710080 | consumed tokens: 52654243840 | elapsed time per iteration (s): 1.06 | learning rate: 3.741E-05 | global batch size: 256 | lm loss: 1.925568E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.792 | TFLOPs: 39.79 | 15: iteration 100440/ 125429 | consumed samples: 25712640 | consumed tokens: 52659486720 | elapsed time per iteration (s): 1.03 | learning rate: 3.740E-05 | global batch size: 256 | lm loss: 1.899997E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.631 | TFLOPs: 41.25 | 15: iteration 100450/ 125429 | consumed samples: 25715200 | consumed tokens: 52664729600 | elapsed time per iteration (s): 1.03 | learning rate: 3.738E-05 | global batch size: 256 | lm loss: 1.911054E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.442 | TFLOPs: 41.06 | 15: iteration 100460/ 125429 | consumed samples: 25717760 | consumed tokens: 52669972480 | elapsed time per iteration (s): 1.05 | learning rate: 3.737E-05 | global batch size: 256 | lm loss: 1.918452E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.785 | TFLOPs: 40.45 | 15: iteration 100470/ 125429 | consumed samples: 25720320 | consumed tokens: 52675215360 | elapsed time per iteration (s): 1.03 | learning rate: 3.735E-05 | global batch size: 256 | lm loss: 1.895352E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.128 | TFLOPs: 41.17 | 15: iteration 100480/ 125429 | consumed samples: 25722880 | consumed tokens: 52680458240 | elapsed time per iteration (s): 1.04 | learning rate: 3.734E-05 | global batch size: 256 | lm loss: 1.905629E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.845 | TFLOPs: 40.63 | 15: iteration 100490/ 125429 | consumed samples: 25725440 | consumed tokens: 52685701120 | elapsed time per iteration (s): 1.07 | learning rate: 3.733E-05 | global batch size: 256 | lm loss: 1.926953E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.389 | TFLOPs: 39.56 | 15: iteration 100500/ 125429 | consumed samples: 25728000 | consumed tokens: 52690944000 | elapsed time per iteration (s): 1.02 | learning rate: 3.731E-05 | global batch size: 256 | lm loss: 1.894932E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.471 | TFLOPs: 41.56 | 15: iteration 100510/ 125429 | consumed samples: 25730560 | consumed tokens: 52696186880 | elapsed time per iteration (s): 1.07 | learning rate: 3.730E-05 | global batch size: 256 | lm loss: 1.929762E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.431 | TFLOPs: 39.57 | 15: iteration 100520/ 125429 | consumed samples: 25733120 | consumed tokens: 52701429760 | elapsed time per iteration (s): 1.06 | learning rate: 3.729E-05 | global batch size: 256 | lm loss: 1.936001E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.350 | TFLOPs: 40.05 | 15: iteration 100530/ 125429 | consumed samples: 25735680 | consumed tokens: 52706672640 | elapsed time per iteration (s): 1.10 | learning rate: 3.727E-05 | global batch size: 256 | lm loss: 1.910981E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.648 | TFLOPs: 38.45 | 15: iteration 100540/ 125429 | consumed samples: 25738240 | consumed tokens: 52711915520 | elapsed time per iteration (s): 1.04 | learning rate: 3.726E-05 | global batch size: 256 | lm loss: 1.918453E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.960 | TFLOPs: 40.81 | 15: iteration 100550/ 125429 | consumed samples: 25740800 | consumed tokens: 52717158400 | elapsed time per iteration (s): 1.03 | learning rate: 3.725E-05 | global batch size: 256 | lm loss: 1.903057E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.081 | TFLOPs: 41.00 | 15: iteration 100560/ 125429 | consumed samples: 25743360 | consumed tokens: 52722401280 | elapsed time per iteration (s): 1.04 | learning rate: 3.723E-05 | global batch size: 256 | lm loss: 1.907104E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.215 | TFLOPs: 40.52 | 15: iteration 100570/ 125429 | consumed samples: 25745920 | consumed tokens: 52727644160 | elapsed time per iteration (s): 1.06 | learning rate: 3.722E-05 | global batch size: 256 | lm loss: 1.903674E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.278 | TFLOPs: 39.87 | 15: iteration 100580/ 125429 | consumed samples: 25748480 | consumed tokens: 52732887040 | elapsed time per iteration (s): 1.06 | learning rate: 3.721E-05 | global batch size: 256 | lm loss: 1.914239E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.071 | TFLOPs: 39.84 | 15: iteration 100590/ 125429 | consumed samples: 25751040 | consumed tokens: 52738129920 | elapsed time per iteration (s): 1.03 | learning rate: 3.719E-05 | global batch size: 256 | lm loss: 1.903740E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.799 | TFLOPs: 40.95 | 15: iteration 100600/ 125429 | consumed samples: 25753600 | consumed tokens: 52743372800 | elapsed time per iteration (s): 1.02 | learning rate: 3.718E-05 | global batch size: 256 | lm loss: 1.908350E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.593 | TFLOPs: 41.41 | 15: iteration 100610/ 125429 | consumed samples: 25756160 | consumed tokens: 52748615680 | elapsed time per iteration (s): 1.03 | learning rate: 3.717E-05 | global batch size: 256 | lm loss: 1.925944E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.699 | TFLOPs: 41.10 | 15: iteration 100620/ 125429 | consumed samples: 25758720 | consumed tokens: 52753858560 | elapsed time per iteration (s): 1.05 | learning rate: 3.715E-05 | global batch size: 256 | lm loss: 1.899968E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.477 | TFLOPs: 40.24 | 15: iteration 100630/ 125429 | consumed samples: 25761280 | consumed tokens: 52759101440 | elapsed time per iteration (s): 1.04 | learning rate: 3.714E-05 | global batch size: 256 | lm loss: 1.932267E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.664 | TFLOPs: 40.76 | 15: iteration 100640/ 125429 | consumed samples: 25763840 | consumed tokens: 52764344320 | elapsed time per iteration (s): 1.03 | learning rate: 3.713E-05 | global batch size: 256 | lm loss: 1.918548E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.866 | TFLOPs: 40.96 | 15: iteration 100650/ 125429 | consumed samples: 25766400 | consumed tokens: 52769587200 | elapsed time per iteration (s): 1.09 | learning rate: 3.711E-05 | global batch size: 256 | lm loss: 1.888183E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.845 | TFLOPs: 38.98 | 15: iteration 100660/ 125429 | consumed samples: 25768960 | consumed tokens: 52774830080 | elapsed time per iteration (s): 1.03 | learning rate: 3.710E-05 | global batch size: 256 | lm loss: 1.916423E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.588 | TFLOPs: 41.25 | 15: iteration 100670/ 125429 | consumed samples: 25771520 | consumed tokens: 52780072960 | elapsed time per iteration (s): 1.05 | learning rate: 3.709E-05 | global batch size: 256 | lm loss: 1.908026E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.724 | TFLOPs: 40.44 | 15: iteration 100680/ 125429 | consumed samples: 25774080 | consumed tokens: 52785315840 | elapsed time per iteration (s): 1.06 | learning rate: 3.707E-05 | global batch size: 256 | lm loss: 1.926830E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.911 | TFLOPs: 39.81 | 15: iteration 100690/ 125429 | consumed samples: 25776640 | consumed tokens: 52790558720 | elapsed time per iteration (s): 1.03 | learning rate: 3.706E-05 | global batch size: 256 | lm loss: 1.938704E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.451 | TFLOPs: 41.06 | 15: iteration 100700/ 125429 | consumed samples: 25779200 | consumed tokens: 52795801600 | elapsed time per iteration (s): 1.03 | learning rate: 3.705E-05 | global batch size: 256 | lm loss: 1.890944E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.079 | TFLOPs: 41.00 | 15: iteration 100710/ 125429 | consumed samples: 25781760 | consumed tokens: 52801044480 | elapsed time per iteration (s): 1.03 | learning rate: 3.703E-05 | global batch size: 256 | lm loss: 1.888264E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.470 | TFLOPs: 41.06 | 15: iteration 100720/ 125429 | consumed samples: 25784320 | consumed tokens: 52806287360 | elapsed time per iteration (s): 1.03 | learning rate: 3.702E-05 | global batch size: 256 | lm loss: 1.891031E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.834 | TFLOPs: 41.12 | 15: iteration 100730/ 125429 | consumed samples: 25786880 | consumed tokens: 52811530240 | elapsed time per iteration (s): 1.04 | learning rate: 3.701E-05 | global batch size: 256 | lm loss: 1.912168E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.128 | TFLOPs: 40.67 | 15: iteration 100740/ 125429 | consumed samples: 25789440 | consumed tokens: 52816773120 | elapsed time per iteration (s): 1.03 | learning rate: 3.699E-05 | global batch size: 256 | lm loss: 1.899868E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.372 | TFLOPs: 41.21 | 15: iteration 100750/ 125429 | consumed samples: 25792000 | consumed tokens: 52822016000 | elapsed time per iteration (s): 1.03 | learning rate: 3.698E-05 | global batch size: 256 | lm loss: 1.921162E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.904 | TFLOPs: 40.97 | 15: iteration 100760/ 125429 | consumed samples: 25794560 | consumed tokens: 52827258880 | elapsed time per iteration (s): 1.05 | learning rate: 3.697E-05 | global batch size: 256 | lm loss: 1.904924E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.406 | TFLOPs: 40.39 | 15: iteration 100770/ 125429 | consumed samples: 25797120 | consumed tokens: 52832501760 | elapsed time per iteration (s): 1.06 | learning rate: 3.695E-05 | global batch size: 256 | lm loss: 1.913108E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.285 | TFLOPs: 39.87 | 15: iteration 100780/ 125429 | consumed samples: 25799680 | consumed tokens: 52837744640 | elapsed time per iteration (s): 1.03 | learning rate: 3.694E-05 | global batch size: 256 | lm loss: 1.876903E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.084 | TFLOPs: 41.16 | 15: iteration 100790/ 125429 | consumed samples: 25802240 | consumed tokens: 52842987520 | elapsed time per iteration (s): 1.05 | learning rate: 3.693E-05 | global batch size: 256 | lm loss: 1.889804E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.370 | TFLOPs: 40.38 | 15: iteration 100800/ 125429 | consumed samples: 25804800 | consumed tokens: 52848230400 | elapsed time per iteration (s): 1.06 | learning rate: 3.691E-05 | global batch size: 256 | lm loss: 1.892546E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.579 | TFLOPs: 40.09 | 15: iteration 100810/ 125429 | consumed samples: 25807360 | consumed tokens: 52853473280 | elapsed time per iteration (s): 1.04 | learning rate: 3.690E-05 | global batch size: 256 | lm loss: 1.914031E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.568 | TFLOPs: 40.58 | 15: iteration 100820/ 125429 | consumed samples: 25809920 | consumed tokens: 52858716160 | elapsed time per iteration (s): 1.04 | learning rate: 3.689E-05 | global batch size: 256 | lm loss: 1.916362E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.069 | TFLOPs: 40.66 | 15: iteration 100830/ 125429 | consumed samples: 25812480 | consumed tokens: 52863959040 | elapsed time per iteration (s): 1.04 | learning rate: 3.687E-05 | global batch size: 256 | lm loss: 1.902640E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.528 | TFLOPs: 40.74 | 15: iteration 100840/ 125429 | consumed samples: 25815040 | consumed tokens: 52869201920 | elapsed time per iteration (s): 1.05 | learning rate: 3.686E-05 | global batch size: 256 | lm loss: 1.888948E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.283 | TFLOPs: 40.20 | 15: iteration 100850/ 125429 | consumed samples: 25817600 | consumed tokens: 52874444800 | elapsed time per iteration (s): 1.04 | learning rate: 3.685E-05 | global batch size: 256 | lm loss: 1.925496E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.662 | TFLOPs: 40.76 | 15: iteration 100860/ 125429 | consumed samples: 25820160 | consumed tokens: 52879687680 | elapsed time per iteration (s): 1.06 | learning rate: 3.683E-05 | global batch size: 256 | lm loss: 1.922494E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.248 | TFLOPs: 40.03 | 15: iteration 100870/ 125429 | consumed samples: 25822720 | consumed tokens: 52884930560 | elapsed time per iteration (s): 1.06 | learning rate: 3.682E-05 | global batch size: 256 | lm loss: 1.915470E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.871 | TFLOPs: 39.97 | 15: iteration 100880/ 125429 | consumed samples: 25825280 | consumed tokens: 52890173440 | elapsed time per iteration (s): 1.08 | learning rate: 3.681E-05 | global batch size: 256 | lm loss: 1.920925E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.989 | TFLOPs: 39.00 | 15: iteration 100890/ 125429 | consumed samples: 25827840 | consumed tokens: 52895416320 | elapsed time per iteration (s): 1.04 | learning rate: 3.679E-05 | global batch size: 256 | lm loss: 1.901152E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.273 | TFLOPs: 40.70 | 15: iteration 100900/ 125429 | consumed samples: 25830400 | consumed tokens: 52900659200 | elapsed time per iteration (s): 1.10 | learning rate: 3.678E-05 | global batch size: 256 | lm loss: 1.906322E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.782 | TFLOPs: 38.63 | 15: iteration 100910/ 125429 | consumed samples: 25832960 | consumed tokens: 52905902080 | elapsed time per iteration (s): 1.03 | learning rate: 3.677E-05 | global batch size: 256 | lm loss: 1.911997E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.795 | TFLOPs: 40.95 | 15: iteration 100920/ 125429 | consumed samples: 25835520 | consumed tokens: 52911144960 | elapsed time per iteration (s): 1.04 | learning rate: 3.675E-05 | global batch size: 256 | lm loss: 1.910146E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.099 | TFLOPs: 40.67 | 15: iteration 100930/ 125429 | consumed samples: 25838080 | consumed tokens: 52916387840 | elapsed time per iteration (s): 1.03 | learning rate: 3.674E-05 | global batch size: 256 | lm loss: 1.906303E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.425 | TFLOPs: 40.89 | 15: iteration 100940/ 125429 | consumed samples: 25840640 | consumed tokens: 52921630720 | elapsed time per iteration (s): 1.04 | learning rate: 3.673E-05 | global batch size: 256 | lm loss: 1.915808E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.070 | TFLOPs: 40.66 | 15: iteration 100950/ 125429 | consumed samples: 25843200 | consumed tokens: 52926873600 | elapsed time per iteration (s): 1.05 | learning rate: 3.672E-05 | global batch size: 256 | lm loss: 1.899631E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.494 | TFLOPs: 40.24 | 15: iteration 100960/ 125429 | consumed samples: 25845760 | consumed tokens: 52932116480 | elapsed time per iteration (s): 1.04 | learning rate: 3.670E-05 | global batch size: 256 | lm loss: 1.916233E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.025 | TFLOPs: 40.49 | 15: iteration 100970/ 125429 | consumed samples: 25848320 | consumed tokens: 52937359360 | elapsed time per iteration (s): 1.04 | learning rate: 3.669E-05 | global batch size: 256 | lm loss: 1.907353E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.971 | TFLOPs: 40.81 | 15: iteration 100980/ 125429 | consumed samples: 25850880 | consumed tokens: 52942602240 | elapsed time per iteration (s): 1.06 | learning rate: 3.668E-05 | global batch size: 256 | lm loss: 1.906103E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.470 | TFLOPs: 39.90 | 15: iteration 100990/ 125429 | consumed samples: 25853440 | consumed tokens: 52947845120 | elapsed time per iteration (s): 1.02 | learning rate: 3.666E-05 | global batch size: 256 | lm loss: 1.892348E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.377 | TFLOPs: 41.54 | 15: iteration 101000/ 125429 | consumed samples: 25856000 | consumed tokens: 52953088000 | elapsed time per iteration (s): 1.05 | learning rate: 3.665E-05 | global batch size: 256 | lm loss: 1.875968E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.934 | TFLOPs: 40.15 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 101000 | lm loss value: 1.975238E+00 | lm loss PPL: 7.208335E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 101000 to checkpoints_1b5 0: [2022-11-27 01:56:23,430] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step101000 is begin to save! 0: [2022-11-27 01:56:23,438] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_01-model_00-model_states.pt... 0: [2022-11-27 01:56:23,680] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_01-model_00-model_states.pt. 0: [2022-11-27 01:56:23,681] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_03-model_00-model_states.pt... 0: [2022-11-27 01:56:23,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_03-model_00-model_states.pt. 0: [2022-11-27 01:56:23,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_04-model_00-model_states.pt... 0: [2022-11-27 01:56:23,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_04-model_00-model_states.pt. 0: [2022-11-27 01:56:23,902] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_05-model_00-model_states.pt... 0: [2022-11-27 01:56:24,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_05-model_00-model_states.pt. 0: [2022-11-27 01:56:24,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_06-model_00-model_states.pt... 0: [2022-11-27 01:56:24,121] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_06-model_00-model_states.pt. 0: [2022-11-27 01:56:24,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_07-model_00-model_states.pt... 0: [2022-11-27 01:56:24,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_07-model_00-model_states.pt. 0: [2022-11-27 01:56:24,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_08-model_00-model_states.pt... 0: [2022-11-27 01:56:24,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_08-model_00-model_states.pt. 0: [2022-11-27 01:56:24,347] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_09-model_00-model_states.pt... 0: [2022-11-27 01:56:24,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_09-model_00-model_states.pt. 0: [2022-11-27 01:56:24,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_10-model_00-model_states.pt... 0: [2022-11-27 01:56:24,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_10-model_00-model_states.pt. 0: [2022-11-27 01:56:24,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_11-model_00-model_states.pt... 0: [2022-11-27 01:56:24,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_11-model_00-model_states.pt. 0: [2022-11-27 01:56:24,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_12-model_00-model_states.pt... 0: [2022-11-27 01:56:24,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_12-model_00-model_states.pt. 0: [2022-11-27 01:56:24,765] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_13-model_00-model_states.pt... 0: [2022-11-27 01:56:24,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_13-model_00-model_states.pt. 0: [2022-11-27 01:56:24,868] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_14-model_00-model_states.pt... 0: [2022-11-27 01:56:24,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_14-model_00-model_states.pt. 0: [2022-11-27 01:56:24,969] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_15-model_00-model_states.pt... 0: [2022-11-27 01:56:25,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_15-model_00-model_states.pt. 0: [2022-11-27 01:56:25,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_16-model_00-model_states.pt... 0: [2022-11-27 01:56:25,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_16-model_00-model_states.pt. 0: [2022-11-27 01:56:25,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_17-model_00-model_states.pt... 0: [2022-11-27 01:56:25,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_17-model_00-model_states.pt. 0: [2022-11-27 01:56:25,290] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_18-model_00-model_states.pt... 0: [2022-11-27 01:56:25,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_18-model_00-model_states.pt. 0: [2022-11-27 01:56:25,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_19-model_00-model_states.pt... 0: [2022-11-27 01:56:25,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_19-model_00-model_states.pt. 0: [2022-11-27 01:56:25,501] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_20-model_00-model_states.pt... 0: [2022-11-27 01:56:25,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_20-model_00-model_states.pt. 0: [2022-11-27 01:56:25,612] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_21-model_00-model_states.pt... 0: [2022-11-27 01:56:25,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_21-model_00-model_states.pt. 0: [2022-11-27 01:56:25,719] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_22-model_00-model_states.pt... 0: [2022-11-27 01:56:25,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_22-model_00-model_states.pt. 0: [2022-11-27 01:56:25,826] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_23-model_00-model_states.pt... 0: [2022-11-27 01:56:25,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_23-model_00-model_states.pt. 0: [2022-11-27 01:56:25,937] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_24-model_00-model_states.pt... 0: [2022-11-27 01:56:26,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_24-model_00-model_states.pt. 0: [2022-11-27 01:56:26,047] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_25-model_00-model_states.pt... 0: [2022-11-27 01:56:26,155] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_25-model_00-model_states.pt. 0: [2022-11-27 01:56:26,156] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_26-model_00-model_states.pt... 0: [2022-11-27 01:56:26,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_26-model_00-model_states.pt. 0: [2022-11-27 01:56:26,271] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_27-model_00-model_states.pt... 0: [2022-11-27 01:56:26,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_27-model_00-model_states.pt. 0: [2022-11-27 01:56:26,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_28-model_00-model_states.pt... 0: [2022-11-27 01:56:26,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_28-model_00-model_states.pt. 0: [2022-11-27 01:56:26,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_29-model_00-model_states.pt... 0: [2022-11-27 01:56:26,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_29-model_00-model_states.pt. 0: [2022-11-27 01:56:26,604] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_30-model_00-model_states.pt... 0: [2022-11-27 01:56:26,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_30-model_00-model_states.pt. 0: [2022-11-27 01:56:26,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/layer_32-model_00-model_states.pt... 0: [2022-11-27 01:56:26,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/layer_32-model_00-model_states.pt. 0: [2022-11-27 01:56:26,712] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step101000/mp_rank_00_model_states.pt 0: [2022-11-27 01:56:26,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/mp_rank_00_model_states.pt... 0: [2022-11-27 01:56:26,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/mp_rank_00_model_states.pt. 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 5: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 6: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 3: [2022-11-27 01:56:26,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step101000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 1: [2022-11-27 01:56:26,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 01:56:26,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-27 01:56:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-27 01:56:26,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,919] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,919] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 01:56:26,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 01:56:26,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 01:56:26,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:56:26,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:56:26,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:56:26,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-27 01:56:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 01:56:26,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-27 01:56:26,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:56:26,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:56:26,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-27 01:56:26,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-27 01:56:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-27 01:56:26,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:56:26,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-27 01:56:26,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 01:56:26,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-27 01:56:26,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:56:26,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-27 01:56:26,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 01:56:26,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 01:56:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 4: [2022-11-27 01:56:26,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 01:56:26,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:56:26,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 15: [2022-11-27 01:56:26,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-27 01:56:26,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 01:56:26,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 01:56:26,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 01:56:26,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 01:56:26,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 13: [2022-11-27 01:56:26,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 9: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 6: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 15: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 01:56:26,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 01:56:26,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 01:56:26,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:56:26,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:56:26,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:56:26,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:56:26,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:56:26,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:56:26,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 1: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 1: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 7: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 01:56:26,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:56:26,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 01:56:26,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-27 01:56:26,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,921] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:56:26,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-27 01:56:26,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-27 01:56:26,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-27 01:56:26,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 14: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 01:56:26,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 01:56:26,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 01:56:26,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-27 01:56:26,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 2: [2022-11-27 01:56:26,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 01:56:26,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 5: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 01:56:26,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 1: [2022-11-27 01:56:26,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 01:56:26,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 01:56:26,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 01:56:26,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 01:56:26,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 01:56:26,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 4: [2022-11-27 01:56:26,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 01:56:26,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 01:56:26,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 8: [2022-11-27 01:56:26,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 01:56:26,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 01:56:26,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 11: [2022-11-27 01:56:26,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 01:56:26,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-27 01:56:26,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 12: [2022-11-27 01:56:26,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 01:56:26,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 01:56:26,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,015] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,015] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 01:56:27,016] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,016] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 01:56:27,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 3: [2022-11-27 01:56:27,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,053] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,053] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 10: [2022-11-27 01:56:27,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 01:56:27,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 01:56:27,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: [2022-11-27 01:56:27,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step101000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 01:56:27,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step101000 is ready now! 0: successfully saved checkpoint at iteration 101000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3737.75 15: iteration 101010/ 125429 | consumed samples: 25858560 | consumed tokens: 52958330880 | elapsed time per iteration (s): 1.43 | learning rate: 3.664E-05 | global batch size: 256 | lm loss: 1.914346E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.726 | TFLOPs: 29.54 | 15: iteration 101020/ 125429 | consumed samples: 25861120 | consumed tokens: 52963573760 | elapsed time per iteration (s): 1.06 | learning rate: 3.662E-05 | global batch size: 256 | lm loss: 1.895885E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.574 | TFLOPs: 39.76 | 15: iteration 101030/ 125429 | consumed samples: 25863680 | consumed tokens: 52968816640 | elapsed time per iteration (s): 1.04 | learning rate: 3.661E-05 | global batch size: 256 | lm loss: 1.923684E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.828 | TFLOPs: 40.62 | 15: iteration 101040/ 125429 | consumed samples: 25866240 | consumed tokens: 52974059520 | elapsed time per iteration (s): 1.04 | learning rate: 3.660E-05 | global batch size: 256 | lm loss: 1.916763E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.418 | TFLOPs: 40.56 | 15: iteration 101050/ 125429 | consumed samples: 25868800 | consumed tokens: 52979302400 | elapsed time per iteration (s): 1.05 | learning rate: 3.658E-05 | global batch size: 256 | lm loss: 1.912194E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.053 | TFLOPs: 40.17 | 15: iteration 101060/ 125429 | consumed samples: 25871360 | consumed tokens: 52984545280 | elapsed time per iteration (s): 1.08 | learning rate: 3.657E-05 | global batch size: 256 | lm loss: 1.869637E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.049 | TFLOPs: 39.34 | 15: iteration 101070/ 125429 | consumed samples: 25873920 | consumed tokens: 52989788160 | elapsed time per iteration (s): 1.04 | learning rate: 3.656E-05 | global batch size: 256 | lm loss: 1.898814E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.167 | TFLOPs: 40.52 | 15: iteration 101080/ 125429 | consumed samples: 25876480 | consumed tokens: 52995031040 | elapsed time per iteration (s): 1.04 | learning rate: 3.654E-05 | global batch size: 256 | lm loss: 1.932274E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.441 | TFLOPs: 40.56 | 15: iteration 101090/ 125429 | consumed samples: 25879040 | consumed tokens: 53000273920 | elapsed time per iteration (s): 1.07 | learning rate: 3.653E-05 | global batch size: 256 | lm loss: 1.896085E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.461 | TFLOPs: 39.57 | 15: iteration 101100/ 125429 | consumed samples: 25881600 | consumed tokens: 53005516800 | elapsed time per iteration (s): 1.02 | learning rate: 3.652E-05 | global batch size: 256 | lm loss: 1.890564E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.238 | TFLOPs: 41.52 | 15: iteration 101110/ 125429 | consumed samples: 25884160 | consumed tokens: 53010759680 | elapsed time per iteration (s): 1.04 | learning rate: 3.650E-05 | global batch size: 256 | lm loss: 1.917609E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.913 | TFLOPs: 40.80 | 15: iteration 101120/ 125429 | consumed samples: 25886720 | consumed tokens: 53016002560 | elapsed time per iteration (s): 1.04 | learning rate: 3.649E-05 | global batch size: 256 | lm loss: 1.875807E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.610 | TFLOPs: 40.59 | 15: iteration 101130/ 125429 | consumed samples: 25889280 | consumed tokens: 53021245440 | elapsed time per iteration (s): 1.03 | learning rate: 3.648E-05 | global batch size: 256 | lm loss: 1.902933E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.388 | TFLOPs: 40.88 | 15: iteration 101140/ 125429 | consumed samples: 25891840 | consumed tokens: 53026488320 | elapsed time per iteration (s): 1.03 | learning rate: 3.646E-05 | global batch size: 256 | lm loss: 1.902620E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.473 | TFLOPs: 41.06 | 15: iteration 101150/ 125429 | consumed samples: 25894400 | consumed tokens: 53031731200 | elapsed time per iteration (s): 1.04 | learning rate: 3.645E-05 | global batch size: 256 | lm loss: 1.921248E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.928 | TFLOPs: 40.64 | 15: iteration 101160/ 125429 | consumed samples: 25896960 | consumed tokens: 53036974080 | elapsed time per iteration (s): 1.07 | learning rate: 3.644E-05 | global batch size: 256 | lm loss: 1.881143E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.043 | TFLOPs: 39.67 | 15: iteration 101170/ 125429 | consumed samples: 25899520 | consumed tokens: 53042216960 | elapsed time per iteration (s): 1.06 | learning rate: 3.643E-05 | global batch size: 256 | lm loss: 1.907600E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.805 | TFLOPs: 39.79 | 15: iteration 101180/ 125429 | consumed samples: 25902080 | consumed tokens: 53047459840 | elapsed time per iteration (s): 1.05 | learning rate: 3.641E-05 | global batch size: 256 | lm loss: 1.914780E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.532 | TFLOPs: 40.25 | 15: iteration 101190/ 125429 | consumed samples: 25904640 | consumed tokens: 53052702720 | elapsed time per iteration (s): 1.06 | learning rate: 3.640E-05 | global batch size: 256 | lm loss: 1.895601E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.591 | TFLOPs: 39.76 | 15: iteration 101200/ 125429 | consumed samples: 25907200 | consumed tokens: 53057945600 | elapsed time per iteration (s): 1.04 | learning rate: 3.639E-05 | global batch size: 256 | lm loss: 1.916449E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.230 | TFLOPs: 40.69 | 15: iteration 101210/ 125429 | consumed samples: 25909760 | consumed tokens: 53063188480 | elapsed time per iteration (s): 1.06 | learning rate: 3.637E-05 | global batch size: 256 | lm loss: 1.925758E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.389 | TFLOPs: 39.73 | 15: iteration 101220/ 125429 | consumed samples: 25912320 | consumed tokens: 53068431360 | elapsed time per iteration (s): 1.05 | learning rate: 3.636E-05 | global batch size: 256 | lm loss: 1.886371E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.659 | TFLOPs: 40.27 | 15: iteration 101230/ 125429 | consumed samples: 25914880 | consumed tokens: 53073674240 | elapsed time per iteration (s): 1.08 | learning rate: 3.635E-05 | global batch size: 256 | lm loss: 1.900193E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.259 | TFLOPs: 39.04 | 15: iteration 101240/ 125429 | consumed samples: 25917440 | consumed tokens: 53078917120 | elapsed time per iteration (s): 1.03 | learning rate: 3.633E-05 | global batch size: 256 | lm loss: 1.919694E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.714 | TFLOPs: 40.94 | 15: iteration 101250/ 125429 | consumed samples: 25920000 | consumed tokens: 53084160000 | elapsed time per iteration (s): 1.04 | learning rate: 3.632E-05 | global batch size: 256 | lm loss: 1.895371E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.867 | TFLOPs: 40.80 | 15: iteration 101260/ 125429 | consumed samples: 25922560 | consumed tokens: 53089402880 | elapsed time per iteration (s): 1.05 | learning rate: 3.631E-05 | global batch size: 256 | lm loss: 1.902596E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.530 | TFLOPs: 40.25 | 15: iteration 101270/ 125429 | consumed samples: 25925120 | consumed tokens: 53094645760 | elapsed time per iteration (s): 1.04 | learning rate: 3.629E-05 | global batch size: 256 | lm loss: 1.886831E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.555 | TFLOPs: 40.75 | 15: iteration 101280/ 125429 | consumed samples: 25927680 | consumed tokens: 53099888640 | elapsed time per iteration (s): 1.04 | learning rate: 3.628E-05 | global batch size: 256 | lm loss: 1.902701E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.624 | TFLOPs: 40.76 | 15: iteration 101290/ 125429 | consumed samples: 25930240 | consumed tokens: 53105131520 | elapsed time per iteration (s): 1.04 | learning rate: 3.627E-05 | global batch size: 256 | lm loss: 1.937516E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.995 | TFLOPs: 40.49 | 15: iteration 101300/ 125429 | consumed samples: 25932800 | consumed tokens: 53110374400 | elapsed time per iteration (s): 1.02 | learning rate: 3.626E-05 | global batch size: 256 | lm loss: 1.901650E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.672 | TFLOPs: 41.59 | 15: iteration 101310/ 125429 | consumed samples: 25935360 | consumed tokens: 53115617280 | elapsed time per iteration (s): 1.02 | learning rate: 3.624E-05 | global batch size: 256 | lm loss: 1.890437E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.071 | TFLOPs: 41.33 | 15: iteration 101320/ 125429 | consumed samples: 25937920 | consumed tokens: 53120860160 | elapsed time per iteration (s): 1.05 | learning rate: 3.623E-05 | global batch size: 256 | lm loss: 1.900472E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.472 | TFLOPs: 40.40 | 15: iteration 101330/ 125429 | consumed samples: 25940480 | consumed tokens: 53126103040 | elapsed time per iteration (s): 1.03 | learning rate: 3.622E-05 | global batch size: 256 | lm loss: 1.936912E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.642 | TFLOPs: 40.92 | 15: iteration 101340/ 125429 | consumed samples: 25943040 | consumed tokens: 53131345920 | elapsed time per iteration (s): 1.06 | learning rate: 3.620E-05 | global batch size: 256 | lm loss: 1.938312E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.655 | TFLOPs: 39.77 | 15: iteration 101350/ 125429 | consumed samples: 25945600 | consumed tokens: 53136588800 | elapsed time per iteration (s): 1.06 | learning rate: 3.619E-05 | global batch size: 256 | lm loss: 1.912258E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.160 | TFLOPs: 39.85 | 15: iteration 101360/ 125429 | consumed samples: 25948160 | consumed tokens: 53141831680 | elapsed time per iteration (s): 1.04 | learning rate: 3.618E-05 | global batch size: 256 | lm loss: 1.910474E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.434 | TFLOPs: 40.56 | 15: iteration 101370/ 125429 | consumed samples: 25950720 | consumed tokens: 53147074560 | elapsed time per iteration (s): 1.05 | learning rate: 3.616E-05 | global batch size: 256 | lm loss: 1.911575E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.371 | TFLOPs: 40.38 | 15: iteration 101380/ 125429 | consumed samples: 25953280 | consumed tokens: 53152317440 | elapsed time per iteration (s): 1.03 | learning rate: 3.615E-05 | global batch size: 256 | lm loss: 1.908456E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.643 | TFLOPs: 40.92 | 15: iteration 101390/ 125429 | consumed samples: 25955840 | consumed tokens: 53157560320 | elapsed time per iteration (s): 1.03 | learning rate: 3.614E-05 | global batch size: 256 | lm loss: 1.904679E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.530 | TFLOPs: 40.91 | 15: iteration 101400/ 125429 | consumed samples: 25958400 | consumed tokens: 53162803200 | elapsed time per iteration (s): 1.03 | learning rate: 3.613E-05 | global batch size: 256 | lm loss: 1.910658E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.490 | TFLOPs: 41.23 | 15: iteration 101410/ 125429 | consumed samples: 25960960 | consumed tokens: 53168046080 | elapsed time per iteration (s): 1.05 | learning rate: 3.611E-05 | global batch size: 256 | lm loss: 1.908269E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.813 | TFLOPs: 40.13 | 15: iteration 101420/ 125429 | consumed samples: 25963520 | consumed tokens: 53173288960 | elapsed time per iteration (s): 1.05 | learning rate: 3.610E-05 | global batch size: 256 | lm loss: 1.931415E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.366 | TFLOPs: 40.22 | 15: iteration 101430/ 125429 | consumed samples: 25966080 | consumed tokens: 53178531840 | elapsed time per iteration (s): 1.10 | learning rate: 3.609E-05 | global batch size: 256 | lm loss: 1.919676E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.659 | TFLOPs: 38.61 | 15: iteration 101440/ 125429 | consumed samples: 25968640 | consumed tokens: 53183774720 | elapsed time per iteration (s): 1.20 | learning rate: 3.607E-05 | global batch size: 256 | lm loss: 1.911977E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.660 | TFLOPs: 35.31 | 15: iteration 101450/ 125429 | consumed samples: 25971200 | consumed tokens: 53189017600 | elapsed time per iteration (s): 1.03 | learning rate: 3.606E-05 | global batch size: 256 | lm loss: 1.887023E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.090 | TFLOPs: 41.00 | 15: iteration 101460/ 125429 | consumed samples: 25973760 | consumed tokens: 53194260480 | elapsed time per iteration (s): 1.02 | learning rate: 3.605E-05 | global batch size: 256 | lm loss: 1.909260E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.240 | TFLOPs: 41.35 | 15: iteration 101470/ 125429 | consumed samples: 25976320 | consumed tokens: 53199503360 | elapsed time per iteration (s): 1.03 | learning rate: 3.603E-05 | global batch size: 256 | lm loss: 1.920855E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.793 | TFLOPs: 41.11 | 15: iteration 101480/ 125429 | consumed samples: 25978880 | consumed tokens: 53204746240 | elapsed time per iteration (s): 1.03 | learning rate: 3.602E-05 | global batch size: 256 | lm loss: 1.887352E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.803 | TFLOPs: 41.12 | 15: iteration 101490/ 125429 | consumed samples: 25981440 | consumed tokens: 53209989120 | elapsed time per iteration (s): 1.05 | learning rate: 3.601E-05 | global batch size: 256 | lm loss: 1.909270E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.895 | TFLOPs: 40.47 | 15: iteration 101500/ 125429 | consumed samples: 25984000 | consumed tokens: 53215232000 | elapsed time per iteration (s): 1.04 | learning rate: 3.600E-05 | global batch size: 256 | lm loss: 1.903523E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.373 | TFLOPs: 40.72 | 15: iteration 101510/ 125429 | consumed samples: 25986560 | consumed tokens: 53220474880 | elapsed time per iteration (s): 1.05 | learning rate: 3.598E-05 | global batch size: 256 | lm loss: 1.915919E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.851 | TFLOPs: 40.46 | 15: iteration 101520/ 125429 | consumed samples: 25989120 | consumed tokens: 53225717760 | elapsed time per iteration (s): 1.18 | learning rate: 3.597E-05 | global batch size: 256 | lm loss: 1.893353E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.848 | TFLOPs: 36.00 | 15: iteration 101530/ 125429 | consumed samples: 25991680 | consumed tokens: 53230960640 | elapsed time per iteration (s): 1.05 | learning rate: 3.596E-05 | global batch size: 256 | lm loss: 1.892717E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.854 | TFLOPs: 40.46 | 15: iteration 101540/ 125429 | consumed samples: 25994240 | consumed tokens: 53236203520 | elapsed time per iteration (s): 1.03 | learning rate: 3.594E-05 | global batch size: 256 | lm loss: 1.911490E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.564 | TFLOPs: 41.24 | 15: iteration 101550/ 125429 | consumed samples: 25996800 | consumed tokens: 53241446400 | elapsed time per iteration (s): 1.08 | learning rate: 3.593E-05 | global batch size: 256 | lm loss: 1.905133E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.573 | TFLOPs: 39.10 | 15: iteration 101560/ 125429 | consumed samples: 25999360 | consumed tokens: 53246689280 | elapsed time per iteration (s): 1.03 | learning rate: 3.592E-05 | global batch size: 256 | lm loss: 1.911311E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.110 | TFLOPs: 41.17 | 15: iteration 101570/ 125429 | consumed samples: 26001920 | consumed tokens: 53251932160 | elapsed time per iteration (s): 1.04 | learning rate: 3.590E-05 | global batch size: 256 | lm loss: 1.907490E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.596 | TFLOPs: 40.59 | 15: iteration 101580/ 125429 | consumed samples: 26004480 | consumed tokens: 53257175040 | elapsed time per iteration (s): 1.03 | learning rate: 3.589E-05 | global batch size: 256 | lm loss: 1.904634E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.201 | TFLOPs: 41.02 | 15: iteration 101590/ 125429 | consumed samples: 26007040 | consumed tokens: 53262417920 | elapsed time per iteration (s): 1.03 | learning rate: 3.588E-05 | global batch size: 256 | lm loss: 1.898288E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.491 | TFLOPs: 40.90 | 15: iteration 101600/ 125429 | consumed samples: 26009600 | consumed tokens: 53267660800 | elapsed time per iteration (s): 1.03 | learning rate: 3.587E-05 | global batch size: 256 | lm loss: 1.892897E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.637 | TFLOPs: 41.25 | 15: iteration 101610/ 125429 | consumed samples: 26012160 | consumed tokens: 53272903680 | elapsed time per iteration (s): 1.07 | learning rate: 3.585E-05 | global batch size: 256 | lm loss: 1.919028E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.340 | TFLOPs: 39.55 | 15: iteration 101620/ 125429 | consumed samples: 26014720 | consumed tokens: 53278146560 | elapsed time per iteration (s): 1.05 | learning rate: 3.584E-05 | global batch size: 256 | lm loss: 1.866377E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.163 | TFLOPs: 40.18 | 15: iteration 101630/ 125429 | consumed samples: 26017280 | consumed tokens: 53283389440 | elapsed time per iteration (s): 1.05 | learning rate: 3.583E-05 | global batch size: 256 | lm loss: 1.902299E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.453 | TFLOPs: 40.40 | 15: iteration 101640/ 125429 | consumed samples: 26019840 | consumed tokens: 53288632320 | elapsed time per iteration (s): 1.07 | learning rate: 3.581E-05 | global batch size: 256 | lm loss: 1.929632E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.900 | TFLOPs: 39.65 | 15: iteration 101650/ 125429 | consumed samples: 26022400 | consumed tokens: 53293875200 | elapsed time per iteration (s): 1.02 | learning rate: 3.580E-05 | global batch size: 256 | lm loss: 1.924427E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.189 | TFLOPs: 41.51 | 15: iteration 101660/ 125429 | consumed samples: 26024960 | consumed tokens: 53299118080 | elapsed time per iteration (s): 1.04 | learning rate: 3.579E-05 | global batch size: 256 | lm loss: 1.908556E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.634 | TFLOPs: 40.59 | 15: iteration 101670/ 125429 | consumed samples: 26027520 | consumed tokens: 53304360960 | elapsed time per iteration (s): 1.07 | learning rate: 3.578E-05 | global batch size: 256 | lm loss: 1.906862E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.646 | TFLOPs: 39.44 | 15: iteration 101680/ 125429 | consumed samples: 26030080 | consumed tokens: 53309603840 | elapsed time per iteration (s): 1.04 | learning rate: 3.576E-05 | global batch size: 256 | lm loss: 1.902954E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.599 | TFLOPs: 40.75 | 15: iteration 101690/ 125429 | consumed samples: 26032640 | consumed tokens: 53314846720 | elapsed time per iteration (s): 1.03 | learning rate: 3.575E-05 | global batch size: 256 | lm loss: 1.917288E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.439 | TFLOPs: 40.89 | 15: iteration 101700/ 125429 | consumed samples: 26035200 | consumed tokens: 53320089600 | elapsed time per iteration (s): 1.04 | learning rate: 3.574E-05 | global batch size: 256 | lm loss: 1.923091E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.939 | TFLOPs: 40.64 | 15: iteration 101710/ 125429 | consumed samples: 26037760 | consumed tokens: 53325332480 | elapsed time per iteration (s): 1.06 | learning rate: 3.572E-05 | global batch size: 256 | lm loss: 1.878105E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.349 | TFLOPs: 39.88 | 15: iteration 101720/ 125429 | consumed samples: 26040320 | consumed tokens: 53330575360 | elapsed time per iteration (s): 1.03 | learning rate: 3.571E-05 | global batch size: 256 | lm loss: 1.906306E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.075 | TFLOPs: 41.00 | 15: iteration 101730/ 125429 | consumed samples: 26042880 | consumed tokens: 53335818240 | elapsed time per iteration (s): 1.06 | learning rate: 3.570E-05 | global batch size: 256 | lm loss: 1.904387E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.998 | TFLOPs: 39.83 | 15: iteration 101740/ 125429 | consumed samples: 26045440 | consumed tokens: 53341061120 | elapsed time per iteration (s): 1.05 | learning rate: 3.569E-05 | global batch size: 256 | lm loss: 1.925051E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.317 | TFLOPs: 40.38 | 15: iteration 101750/ 125429 | consumed samples: 26048000 | consumed tokens: 53346304000 | elapsed time per iteration (s): 1.03 | learning rate: 3.567E-05 | global batch size: 256 | lm loss: 1.883229E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.413 | TFLOPs: 41.05 | 15: iteration 101760/ 125429 | consumed samples: 26050560 | consumed tokens: 53351546880 | elapsed time per iteration (s): 1.04 | learning rate: 3.566E-05 | global batch size: 256 | lm loss: 1.924354E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.118 | TFLOPs: 40.84 | 15: iteration 101770/ 125429 | consumed samples: 26053120 | consumed tokens: 53356789760 | elapsed time per iteration (s): 1.05 | learning rate: 3.565E-05 | global batch size: 256 | lm loss: 1.906783E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.472 | TFLOPs: 40.24 | 15: iteration 101780/ 125429 | consumed samples: 26055680 | consumed tokens: 53362032640 | elapsed time per iteration (s): 1.02 | learning rate: 3.563E-05 | global batch size: 256 | lm loss: 1.894512E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.119 | TFLOPs: 41.50 | 15: iteration 101790/ 125429 | consumed samples: 26058240 | consumed tokens: 53367275520 | elapsed time per iteration (s): 1.04 | learning rate: 3.562E-05 | global batch size: 256 | lm loss: 1.900903E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.548 | TFLOPs: 40.58 | 15: iteration 101800/ 125429 | consumed samples: 26060800 | consumed tokens: 53372518400 | elapsed time per iteration (s): 1.04 | learning rate: 3.561E-05 | global batch size: 256 | lm loss: 1.865334E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.838 | TFLOPs: 40.79 | 15: iteration 101810/ 125429 | consumed samples: 26063360 | consumed tokens: 53377761280 | elapsed time per iteration (s): 1.06 | learning rate: 3.560E-05 | global batch size: 256 | lm loss: 1.923662E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.461 | TFLOPs: 40.07 | 15: iteration 101820/ 125429 | consumed samples: 26065920 | consumed tokens: 53383004160 | elapsed time per iteration (s): 1.03 | learning rate: 3.558E-05 | global batch size: 256 | lm loss: 1.907203E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.822 | TFLOPs: 40.95 | 15: iteration 101830/ 125429 | consumed samples: 26068480 | consumed tokens: 53388247040 | elapsed time per iteration (s): 1.04 | learning rate: 3.557E-05 | global batch size: 256 | lm loss: 1.882246E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.301 | TFLOPs: 40.70 | 15: iteration 101840/ 125429 | consumed samples: 26071040 | consumed tokens: 53393489920 | elapsed time per iteration (s): 1.04 | learning rate: 3.556E-05 | global batch size: 256 | lm loss: 1.930269E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.098 | TFLOPs: 40.83 | 15: iteration 101850/ 125429 | consumed samples: 26073600 | consumed tokens: 53398732800 | elapsed time per iteration (s): 1.18 | learning rate: 3.554E-05 | global batch size: 256 | lm loss: 1.906459E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.580 | TFLOPs: 35.96 | 15: iteration 101860/ 125429 | consumed samples: 26076160 | consumed tokens: 53403975680 | elapsed time per iteration (s): 1.05 | learning rate: 3.553E-05 | global batch size: 256 | lm loss: 1.909085E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.993 | TFLOPs: 40.16 | 15: iteration 101870/ 125429 | consumed samples: 26078720 | consumed tokens: 53409218560 | elapsed time per iteration (s): 1.06 | learning rate: 3.552E-05 | global batch size: 256 | lm loss: 1.903856E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.340 | TFLOPs: 39.88 | 15: iteration 101880/ 125429 | consumed samples: 26081280 | consumed tokens: 53414461440 | elapsed time per iteration (s): 1.04 | learning rate: 3.551E-05 | global batch size: 256 | lm loss: 1.910801E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.924 | TFLOPs: 40.64 | 15: iteration 101890/ 125429 | consumed samples: 26083840 | consumed tokens: 53419704320 | elapsed time per iteration (s): 1.04 | learning rate: 3.549E-05 | global batch size: 256 | lm loss: 1.904066E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.301 | TFLOPs: 40.70 | 15: iteration 101900/ 125429 | consumed samples: 26086400 | consumed tokens: 53424947200 | elapsed time per iteration (s): 1.03 | learning rate: 3.548E-05 | global batch size: 256 | lm loss: 1.922070E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.450 | TFLOPs: 41.22 | 15: iteration 101910/ 125429 | consumed samples: 26088960 | consumed tokens: 53430190080 | elapsed time per iteration (s): 1.03 | learning rate: 3.547E-05 | global batch size: 256 | lm loss: 1.887176E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.651 | TFLOPs: 41.09 | 15: iteration 101920/ 125429 | consumed samples: 26091520 | consumed tokens: 53435432960 | elapsed time per iteration (s): 1.05 | learning rate: 3.546E-05 | global batch size: 256 | lm loss: 1.898996E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.831 | TFLOPs: 40.46 | 15: iteration 101930/ 125429 | consumed samples: 26094080 | consumed tokens: 53440675840 | elapsed time per iteration (s): 1.06 | learning rate: 3.544E-05 | global batch size: 256 | lm loss: 1.892233E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.430 | TFLOPs: 39.73 | 15: iteration 101940/ 125429 | consumed samples: 26096640 | consumed tokens: 53445918720 | elapsed time per iteration (s): 1.04 | learning rate: 3.543E-05 | global batch size: 256 | lm loss: 1.898420E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.702 | TFLOPs: 40.77 | 15: iteration 101950/ 125429 | consumed samples: 26099200 | consumed tokens: 53451161600 | elapsed time per iteration (s): 1.03 | learning rate: 3.542E-05 | global batch size: 256 | lm loss: 1.903903E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.514 | TFLOPs: 40.90 | 15: iteration 101960/ 125429 | consumed samples: 26101760 | consumed tokens: 53456404480 | elapsed time per iteration (s): 1.04 | learning rate: 3.540E-05 | global batch size: 256 | lm loss: 1.894296E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.346 | TFLOPs: 40.71 | 15: iteration 101970/ 125429 | consumed samples: 26104320 | consumed tokens: 53461647360 | elapsed time per iteration (s): 1.04 | learning rate: 3.539E-05 | global batch size: 256 | lm loss: 1.912053E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.255 | TFLOPs: 40.70 | 15: iteration 101980/ 125429 | consumed samples: 26106880 | consumed tokens: 53466890240 | elapsed time per iteration (s): 1.04 | learning rate: 3.538E-05 | global batch size: 256 | lm loss: 1.930712E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.208 | TFLOPs: 40.85 | 15: iteration 101990/ 125429 | consumed samples: 26109440 | consumed tokens: 53472133120 | elapsed time per iteration (s): 1.03 | learning rate: 3.537E-05 | global batch size: 256 | lm loss: 1.936656E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.582 | TFLOPs: 41.25 | 0: [2022-11-27 02:13:55,181] [INFO] [logging.py:68:log_dist] [Rank 0] step=102000, skipped=0, lr=[3.535330614986421e-05, 3.535330614986421e-05, 3.535330614986421e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 0: steps: 102000 loss: 1.8818 iter time (s): 1.045 samples/sec: 245.047 15: iteration 102000/ 125429 | consumed samples: 26112000 | consumed tokens: 53477376000 | elapsed time per iteration (s): 1.05 | learning rate: 3.535E-05 | global batch size: 256 | lm loss: 1.896311E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.954 | TFLOPs: 40.32 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 102000 | lm loss value: 1.787602E+00 | lm loss PPL: 5.975106E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 102000 to checkpoints_1b5 0: [2022-11-27 02:13:55,684] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step102000 is begin to save! 0: [2022-11-27 02:13:55,691] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:13:55,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:13:55,940] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:13:56,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:13:56,044] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:13:56,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:13:56,154] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:13:56,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:13:56,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:13:56,360] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:13:56,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:13:56,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:13:56,468] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:13:56,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:13:56,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:13:56,675] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:13:56,676] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:13:56,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:13:56,776] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:13:56,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:13:56,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:13:56,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:13:56,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:13:57,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:13:57,087] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:13:57,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:13:57,193] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:13:57,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:13:57,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:13:57,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:13:57,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:13:57,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:13:57,499] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:13:57,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:13:57,602] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:13:57,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:13:57,705] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:13:57,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:13:57,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:13:57,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:13:57,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:13:58,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:13:58,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:13:58,117] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:13:58,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:13:58,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:13:58,223] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:13:58,325] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:13:58,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:13:58,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:13:58,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:13:58,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:13:58,528] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:13:58,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:13:58,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_29-model_00-model_states.pt... 0: [2022-11-27 02:13:58,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_29-model_00-model_states.pt. 0: [2022-11-27 02:13:58,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:13:58,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:13:58,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/layer_32-model_00-model_states.pt... 0: [2022-11-27 02:13:58,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/layer_32-model_00-model_states.pt. 0: [2022-11-27 02:13:58,842] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step102000/mp_rank_00_model_states.pt 0: [2022-11-27 02:13:58,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:13:58,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:13:58,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:58,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:58,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:58,883] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:13:58,884] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step102000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:13:59,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-27 02:13:59,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,044] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,044] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,044] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,045] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,045] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-27 02:13:59,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-27 02:13:59,047] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,047] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 02:13:59,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 02:13:59,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:59,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:13:59,049] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:59,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-27 02:13:59,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,052] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,052] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,053] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:13:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 10: [2022-11-27 02:13:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:59,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-27 02:13:59,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,055] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 02:13:59,055] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-27 02:13:59,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 12: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:59,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-27 02:13:59,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-27 02:13:59,057] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,057] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 02:13:59,057] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:13:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-27 02:13:59,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-27 02:13:59,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-27 02:13:59,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,059] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,059] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,061] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,061] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 9: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,063] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:13:59,063] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 02:13:59,063] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,046] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,047] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-27 02:13:59,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:59,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-27 02:13:59,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,054] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,054] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-27 02:13:59,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-27 02:13:59,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 12: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:13:59,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:59,067] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:59,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,046] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,046] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,048] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:13:59,048] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,051] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:13:59,051] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:13:59,064] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,064] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,062] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,062] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:13:59,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 02:13:59,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:13:59,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:13:59,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 02:13:59,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 02:13:59,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-27 02:13:59,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,075] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 02:13:59,075] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:59,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,069] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:59,069] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:59,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 4: [2022-11-27 02:13:59,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:13:59,073] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:13:59,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-27 02:13:59,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 10: [2022-11-27 02:13:59,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:13:59,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 02:13:59,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-27 02:13:59,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-27 02:13:59,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:13:59,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,081] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,081] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:59,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 7: [2022-11-27 02:13:59,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:13:59,084] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:13:59,084] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:59,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:59,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 5: [2022-11-27 02:13:59,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:13:59,085] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:13:59,085] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,085] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:59,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,065] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,065] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 14: [2022-11-27 02:13:59,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:13:59,080] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 02:13:59,080] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,089] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:13:59,071] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,090] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:13:59,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,090] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:13:59,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 11: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:13:59,083] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:13:59,083] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:13:59,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,079] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:13:59,079] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,079] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 15: [2022-11-27 02:13:59,087] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:13:59,087] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 02:13:59,087] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 9: [2022-11-27 02:13:59,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:13:59,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 02:13:59,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 6: [2022-11-27 02:13:59,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:13:59,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:13:59,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 3: [2022-11-27 02:13:59,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:13:59,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 02:13:59,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-27 02:13:59,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:13:59,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 02:13:59,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 2: [2022-11-27 02:13:59,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,082] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,082] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 8: [2022-11-27 02:13:59,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:13:59,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:13:59,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-27 02:13:59,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-27 02:13:59,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 13: [2022-11-27 02:13:59,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,094] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:59,094] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 1: [2022-11-27 02:13:59,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:13:59,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:13:59,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 12: [2022-11-27 02:13:59,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:13:59,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:13:59,110] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: [2022-11-27 02:13:59,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:13:59,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step102000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 02:13:59,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step102000 is ready now! 0: successfully saved checkpoint at iteration 102000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3603.50 15: iteration 102010/ 125429 | consumed samples: 26114560 | consumed tokens: 53482618880 | elapsed time per iteration (s): 1.42 | learning rate: 3.534E-05 | global batch size: 256 | lm loss: 1.910381E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.345 | TFLOPs: 29.80 | 15: iteration 102020/ 125429 | consumed samples: 26117120 | consumed tokens: 53487861760 | elapsed time per iteration (s): 1.04 | learning rate: 3.533E-05 | global batch size: 256 | lm loss: 1.891815E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.614 | TFLOPs: 40.75 | 15: iteration 102030/ 125429 | consumed samples: 26119680 | consumed tokens: 53493104640 | elapsed time per iteration (s): 1.04 | learning rate: 3.532E-05 | global batch size: 256 | lm loss: 1.893998E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.451 | TFLOPs: 40.56 | 15: iteration 102040/ 125429 | consumed samples: 26122240 | consumed tokens: 53498347520 | elapsed time per iteration (s): 1.05 | learning rate: 3.530E-05 | global batch size: 256 | lm loss: 1.877199E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.274 | TFLOPs: 40.37 | 15: iteration 102050/ 125429 | consumed samples: 26124800 | consumed tokens: 53503590400 | elapsed time per iteration (s): 1.03 | learning rate: 3.529E-05 | global batch size: 256 | lm loss: 1.901303E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.511 | TFLOPs: 41.07 | 15: iteration 102060/ 125429 | consumed samples: 26127360 | consumed tokens: 53508833280 | elapsed time per iteration (s): 1.18 | learning rate: 3.528E-05 | global batch size: 256 | lm loss: 1.930804E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.183 | TFLOPs: 35.89 | 15: iteration 102070/ 125429 | consumed samples: 26129920 | consumed tokens: 53514076160 | elapsed time per iteration (s): 1.04 | learning rate: 3.526E-05 | global batch size: 256 | lm loss: 1.896913E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.562 | TFLOPs: 40.58 | 15: iteration 102080/ 125429 | consumed samples: 26132480 | consumed tokens: 53519319040 | elapsed time per iteration (s): 1.04 | learning rate: 3.525E-05 | global batch size: 256 | lm loss: 1.903465E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.821 | TFLOPs: 40.79 | 15: iteration 102090/ 125429 | consumed samples: 26135040 | consumed tokens: 53524561920 | elapsed time per iteration (s): 1.05 | learning rate: 3.524E-05 | global batch size: 256 | lm loss: 1.950258E+00 | grad norm: 0.145 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.310 | TFLOPs: 40.21 | 15: iteration 102100/ 125429 | consumed samples: 26137600 | consumed tokens: 53529804800 | elapsed time per iteration (s): 1.03 | learning rate: 3.523E-05 | global batch size: 256 | lm loss: 1.915984E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.666 | TFLOPs: 41.09 | 15: iteration 102110/ 125429 | consumed samples: 26140160 | consumed tokens: 53535047680 | elapsed time per iteration (s): 1.04 | learning rate: 3.521E-05 | global batch size: 256 | lm loss: 1.907650E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.748 | TFLOPs: 40.78 | 15: iteration 102120/ 125429 | consumed samples: 26142720 | consumed tokens: 53540290560 | elapsed time per iteration (s): 1.06 | learning rate: 3.520E-05 | global batch size: 256 | lm loss: 1.872896E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.464 | TFLOPs: 40.07 | 15: iteration 102130/ 125429 | consumed samples: 26145280 | consumed tokens: 53545533440 | elapsed time per iteration (s): 1.03 | learning rate: 3.519E-05 | global batch size: 256 | lm loss: 1.906181E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.320 | TFLOPs: 41.20 | 15: iteration 102140/ 125429 | consumed samples: 26147840 | consumed tokens: 53550776320 | elapsed time per iteration (s): 1.20 | learning rate: 3.518E-05 | global batch size: 256 | lm loss: 1.912440E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.493 | TFLOPs: 35.28 | 15: iteration 102150/ 125429 | consumed samples: 26150400 | consumed tokens: 53556019200 | elapsed time per iteration (s): 1.05 | learning rate: 3.516E-05 | global batch size: 256 | lm loss: 1.897420E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.814 | TFLOPs: 40.46 | 15: iteration 102160/ 125429 | consumed samples: 26152960 | consumed tokens: 53561262080 | elapsed time per iteration (s): 1.04 | learning rate: 3.515E-05 | global batch size: 256 | lm loss: 1.890740E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.056 | TFLOPs: 40.66 | 15: iteration 102170/ 125429 | consumed samples: 26155520 | consumed tokens: 53566504960 | elapsed time per iteration (s): 1.07 | learning rate: 3.514E-05 | global batch size: 256 | lm loss: 1.899958E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.552 | TFLOPs: 39.59 | 15: iteration 102180/ 125429 | consumed samples: 26158080 | consumed tokens: 53571747840 | elapsed time per iteration (s): 1.03 | learning rate: 3.513E-05 | global batch size: 256 | lm loss: 1.876354E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.547 | TFLOPs: 40.91 | 15: iteration 102190/ 125429 | consumed samples: 26160640 | consumed tokens: 53576990720 | elapsed time per iteration (s): 1.02 | learning rate: 3.511E-05 | global batch size: 256 | lm loss: 1.879817E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.622 | TFLOPs: 41.42 | 15: iteration 102200/ 125429 | consumed samples: 26163200 | consumed tokens: 53582233600 | elapsed time per iteration (s): 1.02 | learning rate: 3.510E-05 | global batch size: 256 | lm loss: 1.905793E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.872 | TFLOPs: 41.29 | 15: iteration 102210/ 125429 | consumed samples: 26165760 | consumed tokens: 53587476480 | elapsed time per iteration (s): 1.05 | learning rate: 3.509E-05 | global batch size: 256 | lm loss: 1.894628E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.849 | TFLOPs: 40.46 | 15: iteration 102220/ 125429 | consumed samples: 26168320 | consumed tokens: 53592719360 | elapsed time per iteration (s): 1.04 | learning rate: 3.507E-05 | global batch size: 256 | lm loss: 1.889813E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.909 | TFLOPs: 40.80 | 15: iteration 102230/ 125429 | consumed samples: 26170880 | consumed tokens: 53597962240 | elapsed time per iteration (s): 1.08 | learning rate: 3.506E-05 | global batch size: 256 | lm loss: 1.887981E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.039 | TFLOPs: 39.34 | 15: iteration 102240/ 125429 | consumed samples: 26173440 | consumed tokens: 53603205120 | elapsed time per iteration (s): 1.03 | learning rate: 3.505E-05 | global batch size: 256 | lm loss: 1.909430E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.656 | TFLOPs: 41.09 | 15: iteration 102250/ 125429 | consumed samples: 26176000 | consumed tokens: 53608448000 | elapsed time per iteration (s): 1.02 | learning rate: 3.504E-05 | global batch size: 256 | lm loss: 1.903864E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.896 | TFLOPs: 41.30 | 15: iteration 102260/ 125429 | consumed samples: 26178560 | consumed tokens: 53613690880 | elapsed time per iteration (s): 1.03 | learning rate: 3.502E-05 | global batch size: 256 | lm loss: 1.917594E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.665 | TFLOPs: 40.93 | 15: iteration 102270/ 125429 | consumed samples: 26181120 | consumed tokens: 53618933760 | elapsed time per iteration (s): 1.04 | learning rate: 3.501E-05 | global batch size: 256 | lm loss: 1.900153E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.064 | TFLOPs: 40.83 | 15: iteration 102280/ 125429 | consumed samples: 26183680 | consumed tokens: 53624176640 | elapsed time per iteration (s): 1.04 | learning rate: 3.500E-05 | global batch size: 256 | lm loss: 1.920536E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.357 | TFLOPs: 40.55 | 15: iteration 102290/ 125429 | consumed samples: 26186240 | consumed tokens: 53629419520 | elapsed time per iteration (s): 1.02 | learning rate: 3.499E-05 | global batch size: 256 | lm loss: 1.899161E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.028 | TFLOPs: 41.32 | 15: iteration 102300/ 125429 | consumed samples: 26188800 | consumed tokens: 53634662400 | elapsed time per iteration (s): 1.03 | learning rate: 3.497E-05 | global batch size: 256 | lm loss: 1.912730E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.612 | TFLOPs: 41.25 | 15: iteration 102310/ 125429 | consumed samples: 26191360 | consumed tokens: 53639905280 | elapsed time per iteration (s): 1.05 | learning rate: 3.496E-05 | global batch size: 256 | lm loss: 1.891745E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.376 | TFLOPs: 40.38 | 15: iteration 102320/ 125429 | consumed samples: 26193920 | consumed tokens: 53645148160 | elapsed time per iteration (s): 1.04 | learning rate: 3.495E-05 | global batch size: 256 | lm loss: 1.882740E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.342 | TFLOPs: 40.71 | 15: iteration 102330/ 125429 | consumed samples: 26196480 | consumed tokens: 53650391040 | elapsed time per iteration (s): 1.03 | learning rate: 3.494E-05 | global batch size: 256 | lm loss: 1.905424E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.340 | TFLOPs: 41.04 | 15: iteration 102340/ 125429 | consumed samples: 26199040 | consumed tokens: 53655633920 | elapsed time per iteration (s): 1.05 | learning rate: 3.492E-05 | global batch size: 256 | lm loss: 1.941720E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.369 | TFLOPs: 40.22 | 15: iteration 102350/ 125429 | consumed samples: 26201600 | consumed tokens: 53660876800 | elapsed time per iteration (s): 1.03 | learning rate: 3.491E-05 | global batch size: 256 | lm loss: 1.911466E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.457 | TFLOPs: 41.22 | 15: iteration 102360/ 125429 | consumed samples: 26204160 | consumed tokens: 53666119680 | elapsed time per iteration (s): 1.09 | learning rate: 3.490E-05 | global batch size: 256 | lm loss: 1.922095E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.873 | TFLOPs: 38.65 | 15: iteration 102370/ 125429 | consumed samples: 26206720 | consumed tokens: 53671362560 | elapsed time per iteration (s): 1.04 | learning rate: 3.489E-05 | global batch size: 256 | lm loss: 1.905348E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.271 | TFLOPs: 40.70 | 15: iteration 102380/ 125429 | consumed samples: 26209280 | consumed tokens: 53676605440 | elapsed time per iteration (s): 1.04 | learning rate: 3.487E-05 | global batch size: 256 | lm loss: 1.905836E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.503 | TFLOPs: 40.74 | 15: iteration 102390/ 125429 | consumed samples: 26211840 | consumed tokens: 53681848320 | elapsed time per iteration (s): 1.03 | learning rate: 3.486E-05 | global batch size: 256 | lm loss: 1.919430E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.177 | TFLOPs: 41.18 | 15: iteration 102400/ 125429 | consumed samples: 26214400 | consumed tokens: 53687091200 | elapsed time per iteration (s): 1.03 | learning rate: 3.485E-05 | global batch size: 256 | lm loss: 1.890141E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.506 | TFLOPs: 40.90 | 15: iteration 102410/ 125429 | consumed samples: 26216960 | consumed tokens: 53692334080 | elapsed time per iteration (s): 1.19 | learning rate: 3.484E-05 | global batch size: 256 | lm loss: 1.893914E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.716 | TFLOPs: 35.65 | 15: iteration 102420/ 125429 | consumed samples: 26219520 | consumed tokens: 53697576960 | elapsed time per iteration (s): 1.02 | learning rate: 3.482E-05 | global batch size: 256 | lm loss: 1.927947E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.010 | TFLOPs: 41.48 | 15: iteration 102430/ 125429 | consumed samples: 26222080 | consumed tokens: 53702819840 | elapsed time per iteration (s): 1.07 | learning rate: 3.481E-05 | global batch size: 256 | lm loss: 1.877343E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.375 | TFLOPs: 39.72 | 15: iteration 102440/ 125429 | consumed samples: 26224640 | consumed tokens: 53708062720 | elapsed time per iteration (s): 1.03 | learning rate: 3.480E-05 | global batch size: 256 | lm loss: 1.918752E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.482 | TFLOPs: 41.06 | 15: iteration 102450/ 125429 | consumed samples: 26227200 | consumed tokens: 53713305600 | elapsed time per iteration (s): 1.04 | learning rate: 3.479E-05 | global batch size: 256 | lm loss: 1.902878E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.329 | TFLOPs: 40.87 | 15: iteration 102460/ 125429 | consumed samples: 26229760 | consumed tokens: 53718548480 | elapsed time per iteration (s): 1.06 | learning rate: 3.477E-05 | global batch size: 256 | lm loss: 1.898013E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.946 | TFLOPs: 39.98 | 15: iteration 102470/ 125429 | consumed samples: 26232320 | consumed tokens: 53723791360 | elapsed time per iteration (s): 1.04 | learning rate: 3.476E-05 | global batch size: 256 | lm loss: 1.899691E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.757 | TFLOPs: 40.61 | 15: iteration 102480/ 125429 | consumed samples: 26234880 | consumed tokens: 53729034240 | elapsed time per iteration (s): 1.20 | learning rate: 3.475E-05 | global batch size: 256 | lm loss: 1.947408E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.618 | TFLOPs: 35.30 | 15: iteration 102490/ 125429 | consumed samples: 26237440 | consumed tokens: 53734277120 | elapsed time per iteration (s): 1.06 | learning rate: 3.474E-05 | global batch size: 256 | lm loss: 1.903991E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.425 | TFLOPs: 39.73 | 15: iteration 102500/ 125429 | consumed samples: 26240000 | consumed tokens: 53739520000 | elapsed time per iteration (s): 1.03 | learning rate: 3.472E-05 | global batch size: 256 | lm loss: 1.909637E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.792 | TFLOPs: 41.11 | 15: iteration 102510/ 125429 | consumed samples: 26242560 | consumed tokens: 53744762880 | elapsed time per iteration (s): 1.06 | learning rate: 3.471E-05 | global batch size: 256 | lm loss: 1.915431E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.542 | TFLOPs: 39.75 | 15: iteration 102520/ 125429 | consumed samples: 26245120 | consumed tokens: 53750005760 | elapsed time per iteration (s): 1.02 | learning rate: 3.470E-05 | global batch size: 256 | lm loss: 1.917674E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.066 | TFLOPs: 41.49 | 15: iteration 102530/ 125429 | consumed samples: 26247680 | consumed tokens: 53755248640 | elapsed time per iteration (s): 1.10 | learning rate: 3.469E-05 | global batch size: 256 | lm loss: 1.898871E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.491 | TFLOPs: 38.59 | 15: iteration 102540/ 125429 | consumed samples: 26250240 | consumed tokens: 53760491520 | elapsed time per iteration (s): 1.02 | learning rate: 3.467E-05 | global batch size: 256 | lm loss: 1.932652E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.736 | TFLOPs: 41.44 | 15: iteration 102550/ 125429 | consumed samples: 26252800 | consumed tokens: 53765734400 | elapsed time per iteration (s): 1.05 | learning rate: 3.466E-05 | global batch size: 256 | lm loss: 1.905579E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.080 | TFLOPs: 40.34 | 15: iteration 102560/ 125429 | consumed samples: 26255360 | consumed tokens: 53770977280 | elapsed time per iteration (s): 1.03 | learning rate: 3.465E-05 | global batch size: 256 | lm loss: 1.913461E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.772 | TFLOPs: 41.11 | 15: iteration 102570/ 125429 | consumed samples: 26257920 | consumed tokens: 53776220160 | elapsed time per iteration (s): 1.04 | learning rate: 3.464E-05 | global batch size: 256 | lm loss: 1.926076E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.164 | TFLOPs: 40.85 | 15: iteration 102580/ 125429 | consumed samples: 26260480 | consumed tokens: 53781463040 | elapsed time per iteration (s): 1.04 | learning rate: 3.462E-05 | global batch size: 256 | lm loss: 1.891673E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.869 | TFLOPs: 40.80 | 15: iteration 102590/ 125429 | consumed samples: 26263040 | consumed tokens: 53786705920 | elapsed time per iteration (s): 1.05 | learning rate: 3.461E-05 | global batch size: 256 | lm loss: 1.904854E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.735 | TFLOPs: 40.44 | 15: iteration 102600/ 125429 | consumed samples: 26265600 | consumed tokens: 53791948800 | elapsed time per iteration (s): 1.03 | learning rate: 3.460E-05 | global batch size: 256 | lm loss: 1.897068E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.492 | TFLOPs: 41.07 | 15: iteration 102610/ 125429 | consumed samples: 26268160 | consumed tokens: 53797191680 | elapsed time per iteration (s): 1.03 | learning rate: 3.459E-05 | global batch size: 256 | lm loss: 1.918058E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.502 | TFLOPs: 41.23 | 15: iteration 102620/ 125429 | consumed samples: 26270720 | consumed tokens: 53802434560 | elapsed time per iteration (s): 1.04 | learning rate: 3.457E-05 | global batch size: 256 | lm loss: 1.909788E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.885 | TFLOPs: 40.80 | 15: iteration 102630/ 125429 | consumed samples: 26273280 | consumed tokens: 53807677440 | elapsed time per iteration (s): 1.03 | learning rate: 3.456E-05 | global batch size: 256 | lm loss: 1.897260E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.565 | TFLOPs: 41.08 | 15: iteration 102640/ 125429 | consumed samples: 26275840 | consumed tokens: 53812920320 | elapsed time per iteration (s): 1.05 | learning rate: 3.455E-05 | global batch size: 256 | lm loss: 1.917251E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.368 | TFLOPs: 40.38 | 15: iteration 102650/ 125429 | consumed samples: 26278400 | consumed tokens: 53818163200 | elapsed time per iteration (s): 1.05 | learning rate: 3.454E-05 | global batch size: 256 | lm loss: 1.898988E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.742 | TFLOPs: 40.11 | 15: iteration 102660/ 125429 | consumed samples: 26280960 | consumed tokens: 53823406080 | elapsed time per iteration (s): 1.02 | learning rate: 3.452E-05 | global batch size: 256 | lm loss: 1.907005E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.088 | TFLOPs: 41.33 | 15: iteration 102670/ 125429 | consumed samples: 26283520 | consumed tokens: 53828648960 | elapsed time per iteration (s): 1.04 | learning rate: 3.451E-05 | global batch size: 256 | lm loss: 1.898625E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.503 | TFLOPs: 40.57 | 15: iteration 102680/ 125429 | consumed samples: 26286080 | consumed tokens: 53833891840 | elapsed time per iteration (s): 1.03 | learning rate: 3.450E-05 | global batch size: 256 | lm loss: 1.912526E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.135 | TFLOPs: 41.01 | 15: iteration 102690/ 125429 | consumed samples: 26288640 | consumed tokens: 53839134720 | elapsed time per iteration (s): 1.03 | learning rate: 3.449E-05 | global batch size: 256 | lm loss: 1.934291E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.727 | TFLOPs: 41.27 | 15: iteration 102700/ 125429 | consumed samples: 26291200 | consumed tokens: 53844377600 | elapsed time per iteration (s): 1.04 | learning rate: 3.447E-05 | global batch size: 256 | lm loss: 1.905136E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.128 | TFLOPs: 40.84 | 15: iteration 102710/ 125429 | consumed samples: 26293760 | consumed tokens: 53849620480 | elapsed time per iteration (s): 1.05 | learning rate: 3.446E-05 | global batch size: 256 | lm loss: 1.912745E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.093 | TFLOPs: 40.17 | 15: iteration 102720/ 125429 | consumed samples: 26296320 | consumed tokens: 53854863360 | elapsed time per iteration (s): 1.05 | learning rate: 3.445E-05 | global batch size: 256 | lm loss: 1.921087E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.666 | TFLOPs: 40.27 | 15: iteration 102730/ 125429 | consumed samples: 26298880 | consumed tokens: 53860106240 | elapsed time per iteration (s): 1.03 | learning rate: 3.444E-05 | global batch size: 256 | lm loss: 1.900995E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.459 | TFLOPs: 40.89 | 15: iteration 102740/ 125429 | consumed samples: 26301440 | consumed tokens: 53865349120 | elapsed time per iteration (s): 1.24 | learning rate: 3.443E-05 | global batch size: 256 | lm loss: 1.873390E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 206.716 | TFLOPs: 34.16 | 15: iteration 102750/ 125429 | consumed samples: 26304000 | consumed tokens: 53870592000 | elapsed time per iteration (s): 1.05 | learning rate: 3.441E-05 | global batch size: 256 | lm loss: 1.926602E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.544 | TFLOPs: 40.25 | 15: iteration 102760/ 125429 | consumed samples: 26306560 | consumed tokens: 53875834880 | elapsed time per iteration (s): 1.02 | learning rate: 3.440E-05 | global batch size: 256 | lm loss: 1.928309E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.649 | TFLOPs: 41.42 | 15: iteration 102770/ 125429 | consumed samples: 26309120 | consumed tokens: 53881077760 | elapsed time per iteration (s): 1.04 | learning rate: 3.439E-05 | global batch size: 256 | lm loss: 1.914278E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.091 | TFLOPs: 40.50 | 15: iteration 102780/ 125429 | consumed samples: 26311680 | consumed tokens: 53886320640 | elapsed time per iteration (s): 1.04 | learning rate: 3.438E-05 | global batch size: 256 | lm loss: 1.885937E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.238 | TFLOPs: 40.53 | 15: iteration 102790/ 125429 | consumed samples: 26314240 | consumed tokens: 53891563520 | elapsed time per iteration (s): 1.05 | learning rate: 3.436E-05 | global batch size: 256 | lm loss: 1.934633E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.313 | TFLOPs: 40.21 | 15: iteration 102800/ 125429 | consumed samples: 26316800 | consumed tokens: 53896806400 | elapsed time per iteration (s): 1.05 | learning rate: 3.435E-05 | global batch size: 256 | lm loss: 1.932729E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.416 | TFLOPs: 40.39 | 15: iteration 102810/ 125429 | consumed samples: 26319360 | consumed tokens: 53902049280 | elapsed time per iteration (s): 1.03 | learning rate: 3.434E-05 | global batch size: 256 | lm loss: 1.929078E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.592 | TFLOPs: 41.25 | 15: iteration 102820/ 125429 | consumed samples: 26321920 | consumed tokens: 53907292160 | elapsed time per iteration (s): 1.04 | learning rate: 3.433E-05 | global batch size: 256 | lm loss: 1.913327E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.132 | TFLOPs: 40.84 | 15: iteration 102830/ 125429 | consumed samples: 26324480 | consumed tokens: 53912535040 | elapsed time per iteration (s): 1.05 | learning rate: 3.431E-05 | global batch size: 256 | lm loss: 1.900454E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.240 | TFLOPs: 40.36 | 15: iteration 102840/ 125429 | consumed samples: 26327040 | consumed tokens: 53917777920 | elapsed time per iteration (s): 1.05 | learning rate: 3.430E-05 | global batch size: 256 | lm loss: 1.943503E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.127 | TFLOPs: 40.18 | 15: iteration 102850/ 125429 | consumed samples: 26329600 | consumed tokens: 53923020800 | elapsed time per iteration (s): 1.03 | learning rate: 3.429E-05 | global batch size: 256 | lm loss: 1.904037E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.732 | TFLOPs: 41.10 | 15: iteration 102860/ 125429 | consumed samples: 26332160 | consumed tokens: 53928263680 | elapsed time per iteration (s): 1.19 | learning rate: 3.428E-05 | global batch size: 256 | lm loss: 1.905391E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.369 | TFLOPs: 35.43 | 15: iteration 102870/ 125429 | consumed samples: 26334720 | consumed tokens: 53933506560 | elapsed time per iteration (s): 1.02 | learning rate: 3.426E-05 | global batch size: 256 | lm loss: 1.909631E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.862 | TFLOPs: 41.46 | 15: iteration 102880/ 125429 | consumed samples: 26337280 | consumed tokens: 53938749440 | elapsed time per iteration (s): 1.08 | learning rate: 3.425E-05 | global batch size: 256 | lm loss: 1.895954E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.893 | TFLOPs: 39.31 | 15: iteration 102890/ 125429 | consumed samples: 26339840 | consumed tokens: 53943992320 | elapsed time per iteration (s): 1.04 | learning rate: 3.424E-05 | global batch size: 256 | lm loss: 1.918671E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.771 | TFLOPs: 40.62 | 15: iteration 102900/ 125429 | consumed samples: 26342400 | consumed tokens: 53949235200 | elapsed time per iteration (s): 1.05 | learning rate: 3.423E-05 | global batch size: 256 | lm loss: 1.902504E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.924 | TFLOPs: 40.31 | 15: iteration 102910/ 125429 | consumed samples: 26344960 | consumed tokens: 53954478080 | elapsed time per iteration (s): 1.04 | learning rate: 3.422E-05 | global batch size: 256 | lm loss: 1.898914E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.320 | TFLOPs: 40.87 | 15: iteration 102920/ 125429 | consumed samples: 26347520 | consumed tokens: 53959720960 | elapsed time per iteration (s): 1.02 | learning rate: 3.420E-05 | global batch size: 256 | lm loss: 1.931229E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.503 | TFLOPs: 41.40 | 15: iteration 102930/ 125429 | consumed samples: 26350080 | consumed tokens: 53964963840 | elapsed time per iteration (s): 1.23 | learning rate: 3.419E-05 | global batch size: 256 | lm loss: 1.913274E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 208.560 | TFLOPs: 34.47 | 15: iteration 102940/ 125429 | consumed samples: 26352640 | consumed tokens: 53970206720 | elapsed time per iteration (s): 1.20 | learning rate: 3.418E-05 | global batch size: 256 | lm loss: 1.916872E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.808 | TFLOPs: 35.17 | 15: iteration 102950/ 125429 | consumed samples: 26355200 | consumed tokens: 53975449600 | elapsed time per iteration (s): 1.02 | learning rate: 3.417E-05 | global batch size: 256 | lm loss: 1.901453E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.868 | TFLOPs: 41.46 | 15: iteration 102960/ 125429 | consumed samples: 26357760 | consumed tokens: 53980692480 | elapsed time per iteration (s): 1.04 | learning rate: 3.415E-05 | global batch size: 256 | lm loss: 1.912422E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.428 | TFLOPs: 40.72 | 15: iteration 102970/ 125429 | consumed samples: 26360320 | consumed tokens: 53985935360 | elapsed time per iteration (s): 1.03 | learning rate: 3.414E-05 | global batch size: 256 | lm loss: 1.888700E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.786 | TFLOPs: 41.11 | 15: iteration 102980/ 125429 | consumed samples: 26362880 | consumed tokens: 53991178240 | elapsed time per iteration (s): 1.03 | learning rate: 3.413E-05 | global batch size: 256 | lm loss: 1.907535E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.140 | TFLOPs: 41.01 | 15: iteration 102990/ 125429 | consumed samples: 26365440 | consumed tokens: 53996421120 | elapsed time per iteration (s): 1.02 | learning rate: 3.412E-05 | global batch size: 256 | lm loss: 1.882965E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.858 | TFLOPs: 41.29 | 15: iteration 103000/ 125429 | consumed samples: 26368000 | consumed tokens: 54001664000 | elapsed time per iteration (s): 1.06 | learning rate: 3.411E-05 | global batch size: 256 | lm loss: 1.917631E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.256 | TFLOPs: 40.03 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 103000 | lm loss value: 1.902872E+00 | lm loss PPL: 6.705121E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 103000 to checkpoints_1b5 0: [2022-11-27 02:31:32,571] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step103000 is begin to save! 0: [2022-11-27 02:31:32,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:31:32,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:31:32,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:31:32,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:31:32,953] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:31:33,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:31:33,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:31:33,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:31:33,175] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:31:33,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:31:33,286] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:31:33,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:31:33,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:31:33,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:31:33,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:31:33,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:31:33,617] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:31:33,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:31:33,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:31:33,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:31:33,835] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:31:33,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:31:33,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:31:34,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:31:34,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:31:34,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:31:34,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:31:34,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:31:34,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:31:34,378] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:31:34,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:31:34,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:31:34,484] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:31:34,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:31:34,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:31:34,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:31:34,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:31:34,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:31:34,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:31:34,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:31:34,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:31:35,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:31:35,021] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:31:35,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:31:35,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:31:35,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:31:35,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:31:35,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:31:35,342] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:31:35,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:31:35,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:31:35,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:31:35,556] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:31:35,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:31:35,669] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_29-model_00-model_states.pt... 0: [2022-11-27 02:31:35,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_29-model_00-model_states.pt. 0: [2022-11-27 02:31:35,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:31:35,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:31:35,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/layer_32-model_00-model_states.pt... 0: [2022-11-27 02:31:35,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/layer_32-model_00-model_states.pt. 0: [2022-11-27 02:31:35,887] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step103000/mp_rank_00_model_states.pt 0: [2022-11-27 02:31:35,887] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:31:35,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:31:35,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step103000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:31:36,091] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:31:36,091] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,091] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:31:36,092] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,092] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:31:36,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,096] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,096] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:31:36,096] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,097] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:31:36,098] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:31:36,098] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,098] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:31:36,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 02:31:36,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,100] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 02:31:36,100] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 12: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 12: [2022-11-27 02:31:36,101] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,101] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 02:31:36,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:31:36,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 3: [2022-11-27 02:31:36,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:31:36,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:31:36,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:31:36,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 02:31:36,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-27 02:31:36,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:31:36,108] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,108] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-27 02:31:36,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:31:36,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,111] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 02:31:36,111] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:31:36,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:31:36,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,112] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,112] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,113] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 02:31:36,113] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 02:31:36,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,114] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 02:31:36,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,097] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,097] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,099] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:31:36,099] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,099] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:31:36,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,102] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:31:36,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,104] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:31:36,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:31:36,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,109] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,109] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-27 02:31:36,115] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,115] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,115] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-27 02:31:36,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,116] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,116] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-27 02:31:36,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-27 02:31:36,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-27 02:31:36,118] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,118] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:31:36,119] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,119] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,123] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 1: [2022-11-27 02:31:36,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:31:36,123] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:31:36,124] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,126] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,126] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 02:31:36,126] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,127] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 02:31:36,127] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,128] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:31:36,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,128] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 12: [2022-11-27 02:31:36,128] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,131] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,131] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,131] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,132] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:31:36,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:31:36,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:31:36,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:31:36,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,133] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,133] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 2: [2022-11-27 02:31:36,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:31:36,134] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:31:36,134] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 6: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:31:36,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:31:36,138] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 6: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 5: [2022-11-27 02:31:36,138] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-27 02:31:36,139] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,139] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-27 02:31:36,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,144] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:31:36,144] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:31:36,145] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-27 02:31:36,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 9: [2022-11-27 02:31:36,150] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:31:36,150] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 02:31:36,150] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 7: [2022-11-27 02:31:36,154] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:31:36,154] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 02:31:36,154] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,162] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,162] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,162] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,163] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,163] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,163] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,164] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,164] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,164] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:31:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 15: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,168] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-27 02:31:36,168] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,171] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,172] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 8: [2022-11-27 02:31:36,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:31:36,173] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 02:31:36,173] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 3: [2022-11-27 02:31:36,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:31:36,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 14: [2022-11-27 02:31:36,151] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 02:31:36,151] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:31:36,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 02:31:36,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 10: [2022-11-27 02:31:36,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 11: [2022-11-27 02:31:36,171] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 02:31:36,171] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:31:36,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 02:31:36,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 13: [2022-11-27 02:31:36,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: [2022-11-27 02:31:36,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:31:36,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step103000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 02:31:36,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 4: [2022-11-27 02:31:36,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step103000 is ready now! 0: successfully saved checkpoint at iteration 103000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3788.18 15: iteration 103010/ 125429 | consumed samples: 26370560 | consumed tokens: 54006906880 | elapsed time per iteration (s): 1.48 | learning rate: 3.409E-05 | global batch size: 256 | lm loss: 1.912778E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.230 | TFLOPs: 28.63 | 15: iteration 103020/ 125429 | consumed samples: 26373120 | consumed tokens: 54012149760 | elapsed time per iteration (s): 1.06 | learning rate: 3.408E-05 | global batch size: 256 | lm loss: 1.909689E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.596 | TFLOPs: 40.09 | 15: iteration 103030/ 125429 | consumed samples: 26375680 | consumed tokens: 54017392640 | elapsed time per iteration (s): 1.19 | learning rate: 3.407E-05 | global batch size: 256 | lm loss: 1.891803E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.554 | TFLOPs: 35.62 | 15: iteration 103040/ 125429 | consumed samples: 26378240 | consumed tokens: 54022635520 | elapsed time per iteration (s): 1.03 | learning rate: 3.406E-05 | global batch size: 256 | lm loss: 1.897505E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.886 | TFLOPs: 41.13 | 15: iteration 103050/ 125429 | consumed samples: 26380800 | consumed tokens: 54027878400 | elapsed time per iteration (s): 1.06 | learning rate: 3.404E-05 | global batch size: 256 | lm loss: 1.909460E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.386 | TFLOPs: 39.73 | 15: iteration 103060/ 125429 | consumed samples: 26383360 | consumed tokens: 54033121280 | elapsed time per iteration (s): 1.03 | learning rate: 3.403E-05 | global batch size: 256 | lm loss: 1.905347E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.828 | TFLOPs: 41.12 | 15: iteration 103070/ 125429 | consumed samples: 26385920 | consumed tokens: 54038364160 | elapsed time per iteration (s): 1.06 | learning rate: 3.402E-05 | global batch size: 256 | lm loss: 1.906131E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.529 | TFLOPs: 39.91 | 15: iteration 103080/ 125429 | consumed samples: 26388480 | consumed tokens: 54043607040 | elapsed time per iteration (s): 1.04 | learning rate: 3.401E-05 | global batch size: 256 | lm loss: 1.941014E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.052 | TFLOPs: 40.50 | 15: iteration 103090/ 125429 | consumed samples: 26391040 | consumed tokens: 54048849920 | elapsed time per iteration (s): 1.05 | learning rate: 3.400E-05 | global batch size: 256 | lm loss: 1.913385E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.063 | TFLOPs: 40.33 | 15: iteration 103100/ 125429 | consumed samples: 26393600 | consumed tokens: 54054092800 | elapsed time per iteration (s): 1.04 | learning rate: 3.398E-05 | global batch size: 256 | lm loss: 1.945422E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.141 | TFLOPs: 40.68 | 15: iteration 103110/ 125429 | consumed samples: 26396160 | consumed tokens: 54059335680 | elapsed time per iteration (s): 1.03 | learning rate: 3.397E-05 | global batch size: 256 | lm loss: 1.909519E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.743 | TFLOPs: 41.27 | 15: iteration 103120/ 125429 | consumed samples: 26398720 | consumed tokens: 54064578560 | elapsed time per iteration (s): 1.18 | learning rate: 3.396E-05 | global batch size: 256 | lm loss: 1.927956E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.662 | TFLOPs: 35.80 | 15: iteration 103130/ 125429 | consumed samples: 26401280 | consumed tokens: 54069821440 | elapsed time per iteration (s): 1.18 | learning rate: 3.395E-05 | global batch size: 256 | lm loss: 1.919191E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.514 | TFLOPs: 35.95 | 15: iteration 103140/ 125429 | consumed samples: 26403840 | consumed tokens: 54075064320 | elapsed time per iteration (s): 1.05 | learning rate: 3.393E-05 | global batch size: 256 | lm loss: 1.941151E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.062 | TFLOPs: 40.33 | 15: iteration 103150/ 125429 | consumed samples: 26406400 | consumed tokens: 54080307200 | elapsed time per iteration (s): 1.18 | learning rate: 3.392E-05 | global batch size: 256 | lm loss: 1.886045E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.985 | TFLOPs: 35.86 | 15: iteration 103160/ 125429 | consumed samples: 26408960 | consumed tokens: 54085550080 | elapsed time per iteration (s): 1.02 | learning rate: 3.391E-05 | global batch size: 256 | lm loss: 1.919049E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.372 | TFLOPs: 41.54 | 15: iteration 103170/ 125429 | consumed samples: 26411520 | consumed tokens: 54090792960 | elapsed time per iteration (s): 1.02 | learning rate: 3.390E-05 | global batch size: 256 | lm loss: 1.886722E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.323 | TFLOPs: 41.37 | 15: iteration 103180/ 125429 | consumed samples: 26414080 | consumed tokens: 54096035840 | elapsed time per iteration (s): 1.03 | learning rate: 3.389E-05 | global batch size: 256 | lm loss: 1.914989E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.802 | TFLOPs: 41.12 | 15: iteration 103190/ 125429 | consumed samples: 26416640 | consumed tokens: 54101278720 | elapsed time per iteration (s): 1.02 | learning rate: 3.387E-05 | global batch size: 256 | lm loss: 1.911267E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.570 | TFLOPs: 41.41 | 15: iteration 103200/ 125429 | consumed samples: 26419200 | consumed tokens: 54106521600 | elapsed time per iteration (s): 1.05 | learning rate: 3.386E-05 | global batch size: 256 | lm loss: 1.904491E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.769 | TFLOPs: 40.45 | 15: iteration 103210/ 125429 | consumed samples: 26421760 | consumed tokens: 54111764480 | elapsed time per iteration (s): 1.04 | learning rate: 3.385E-05 | global batch size: 256 | lm loss: 1.931077E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.830 | TFLOPs: 40.79 | 15: iteration 103220/ 125429 | consumed samples: 26424320 | consumed tokens: 54117007360 | elapsed time per iteration (s): 1.04 | learning rate: 3.384E-05 | global batch size: 256 | lm loss: 1.898170E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.420 | TFLOPs: 40.56 | 15: iteration 103230/ 125429 | consumed samples: 26426880 | consumed tokens: 54122250240 | elapsed time per iteration (s): 1.04 | learning rate: 3.383E-05 | global batch size: 256 | lm loss: 1.893723E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.292 | TFLOPs: 40.87 | 15: iteration 103240/ 125429 | consumed samples: 26429440 | consumed tokens: 54127493120 | elapsed time per iteration (s): 1.03 | learning rate: 3.381E-05 | global batch size: 256 | lm loss: 1.916568E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.790 | TFLOPs: 40.95 | 15: iteration 103250/ 125429 | consumed samples: 26432000 | consumed tokens: 54132736000 | elapsed time per iteration (s): 1.04 | learning rate: 3.380E-05 | global batch size: 256 | lm loss: 1.916777E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.070 | TFLOPs: 40.50 | 15: iteration 103260/ 125429 | consumed samples: 26434560 | consumed tokens: 54137978880 | elapsed time per iteration (s): 1.04 | learning rate: 3.379E-05 | global batch size: 256 | lm loss: 1.913501E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.630 | TFLOPs: 40.59 | 15: iteration 103270/ 125429 | consumed samples: 26437120 | consumed tokens: 54143221760 | elapsed time per iteration (s): 1.15 | learning rate: 3.378E-05 | global batch size: 256 | lm loss: 1.907827E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.973 | TFLOPs: 36.68 | 15: iteration 103280/ 125429 | consumed samples: 26439680 | consumed tokens: 54148464640 | elapsed time per iteration (s): 1.05 | learning rate: 3.376E-05 | global batch size: 256 | lm loss: 1.928008E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.118 | TFLOPs: 40.34 | 15: iteration 103290/ 125429 | consumed samples: 26442240 | consumed tokens: 54153707520 | elapsed time per iteration (s): 1.05 | learning rate: 3.375E-05 | global batch size: 256 | lm loss: 1.887150E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.037 | TFLOPs: 40.33 | 15: iteration 103300/ 125429 | consumed samples: 26444800 | consumed tokens: 54158950400 | elapsed time per iteration (s): 1.03 | learning rate: 3.374E-05 | global batch size: 256 | lm loss: 1.914224E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.883 | TFLOPs: 40.96 | 15: iteration 103310/ 125429 | consumed samples: 26447360 | consumed tokens: 54164193280 | elapsed time per iteration (s): 1.04 | learning rate: 3.373E-05 | global batch size: 256 | lm loss: 1.898993E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.280 | TFLOPs: 40.53 | 15: iteration 103320/ 125429 | consumed samples: 26449920 | consumed tokens: 54169436160 | elapsed time per iteration (s): 1.04 | learning rate: 3.372E-05 | global batch size: 256 | lm loss: 1.891907E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.260 | TFLOPs: 40.70 | 15: iteration 103330/ 125429 | consumed samples: 26452480 | consumed tokens: 54174679040 | elapsed time per iteration (s): 1.03 | learning rate: 3.370E-05 | global batch size: 256 | lm loss: 1.911106E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.422 | TFLOPs: 41.05 | 15: iteration 103340/ 125429 | consumed samples: 26455040 | consumed tokens: 54179921920 | elapsed time per iteration (s): 1.07 | learning rate: 3.369E-05 | global batch size: 256 | lm loss: 1.927167E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.900 | TFLOPs: 39.65 | 15: iteration 103350/ 125429 | consumed samples: 26457600 | consumed tokens: 54185164800 | elapsed time per iteration (s): 1.06 | learning rate: 3.368E-05 | global batch size: 256 | lm loss: 1.904608E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.175 | TFLOPs: 39.86 | 15: iteration 103360/ 125429 | consumed samples: 26460160 | consumed tokens: 54190407680 | elapsed time per iteration (s): 1.04 | learning rate: 3.367E-05 | global batch size: 256 | lm loss: 1.913860E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.595 | TFLOPs: 40.59 | 15: iteration 103370/ 125429 | consumed samples: 26462720 | consumed tokens: 54195650560 | elapsed time per iteration (s): 1.03 | learning rate: 3.366E-05 | global batch size: 256 | lm loss: 1.923998E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.399 | TFLOPs: 41.05 | 15: iteration 103380/ 125429 | consumed samples: 26465280 | consumed tokens: 54200893440 | elapsed time per iteration (s): 1.05 | learning rate: 3.364E-05 | global batch size: 256 | lm loss: 1.900633E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.012 | TFLOPs: 40.32 | 15: iteration 103390/ 125429 | consumed samples: 26467840 | consumed tokens: 54206136320 | elapsed time per iteration (s): 1.06 | learning rate: 3.363E-05 | global batch size: 256 | lm loss: 1.939132E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.081 | TFLOPs: 39.84 | 15: iteration 103400/ 125429 | consumed samples: 26470400 | consumed tokens: 54211379200 | elapsed time per iteration (s): 1.05 | learning rate: 3.362E-05 | global batch size: 256 | lm loss: 1.920196E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.231 | TFLOPs: 40.36 | 15: iteration 103410/ 125429 | consumed samples: 26472960 | consumed tokens: 54216622080 | elapsed time per iteration (s): 1.06 | learning rate: 3.361E-05 | global batch size: 256 | lm loss: 1.921580E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.245 | TFLOPs: 40.03 | 15: iteration 103420/ 125429 | consumed samples: 26475520 | consumed tokens: 54221864960 | elapsed time per iteration (s): 1.05 | learning rate: 3.360E-05 | global batch size: 256 | lm loss: 1.892229E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.064 | TFLOPs: 40.17 | 15: iteration 103430/ 125429 | consumed samples: 26478080 | consumed tokens: 54227107840 | elapsed time per iteration (s): 1.05 | learning rate: 3.358E-05 | global batch size: 256 | lm loss: 1.897353E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.355 | TFLOPs: 40.22 | 15: iteration 103440/ 125429 | consumed samples: 26480640 | consumed tokens: 54232350720 | elapsed time per iteration (s): 1.06 | learning rate: 3.357E-05 | global batch size: 256 | lm loss: 1.916445E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.774 | TFLOPs: 39.79 | 15: iteration 103450/ 125429 | consumed samples: 26483200 | consumed tokens: 54237593600 | elapsed time per iteration (s): 1.06 | learning rate: 3.356E-05 | global batch size: 256 | lm loss: 1.914312E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.841 | TFLOPs: 39.80 | 15: iteration 103460/ 125429 | consumed samples: 26485760 | consumed tokens: 54242836480 | elapsed time per iteration (s): 1.07 | learning rate: 3.355E-05 | global batch size: 256 | lm loss: 1.918457E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.230 | TFLOPs: 39.70 | 15: iteration 103470/ 125429 | consumed samples: 26488320 | consumed tokens: 54248079360 | elapsed time per iteration (s): 1.04 | learning rate: 3.354E-05 | global batch size: 256 | lm loss: 1.899526E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.996 | TFLOPs: 40.65 | 15: iteration 103480/ 125429 | consumed samples: 26490880 | consumed tokens: 54253322240 | elapsed time per iteration (s): 1.04 | learning rate: 3.352E-05 | global batch size: 256 | lm loss: 1.903016E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.157 | TFLOPs: 40.68 | 15: iteration 103490/ 125429 | consumed samples: 26493440 | consumed tokens: 54258565120 | elapsed time per iteration (s): 1.07 | learning rate: 3.351E-05 | global batch size: 256 | lm loss: 1.892084E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.522 | TFLOPs: 39.42 | 15: iteration 103500/ 125429 | consumed samples: 26496000 | consumed tokens: 54263808000 | elapsed time per iteration (s): 1.04 | learning rate: 3.350E-05 | global batch size: 256 | lm loss: 1.938104E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.286 | TFLOPs: 40.70 | 15: iteration 103510/ 125429 | consumed samples: 26498560 | consumed tokens: 54269050880 | elapsed time per iteration (s): 1.03 | learning rate: 3.349E-05 | global batch size: 256 | lm loss: 1.897412E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.530 | TFLOPs: 41.07 | 15: iteration 103520/ 125429 | consumed samples: 26501120 | consumed tokens: 54274293760 | elapsed time per iteration (s): 1.05 | learning rate: 3.348E-05 | global batch size: 256 | lm loss: 1.927045E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.945 | TFLOPs: 40.15 | 15: iteration 103530/ 125429 | consumed samples: 26503680 | consumed tokens: 54279536640 | elapsed time per iteration (s): 1.05 | learning rate: 3.346E-05 | global batch size: 256 | lm loss: 1.912423E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.746 | TFLOPs: 40.28 | 15: iteration 103540/ 125429 | consumed samples: 26506240 | consumed tokens: 54284779520 | elapsed time per iteration (s): 1.07 | learning rate: 3.345E-05 | global batch size: 256 | lm loss: 1.893091E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.175 | TFLOPs: 39.53 | 15: iteration 103550/ 125429 | consumed samples: 26508800 | consumed tokens: 54290022400 | elapsed time per iteration (s): 1.04 | learning rate: 3.344E-05 | global batch size: 256 | lm loss: 1.892927E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.584 | TFLOPs: 40.75 | 15: iteration 103560/ 125429 | consumed samples: 26511360 | consumed tokens: 54295265280 | elapsed time per iteration (s): 1.04 | learning rate: 3.343E-05 | global batch size: 256 | lm loss: 1.896840E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.161 | TFLOPs: 40.51 | 15: iteration 103570/ 125429 | consumed samples: 26513920 | consumed tokens: 54300508160 | elapsed time per iteration (s): 1.07 | learning rate: 3.342E-05 | global batch size: 256 | lm loss: 1.909897E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.200 | TFLOPs: 39.36 | 15: iteration 103580/ 125429 | consumed samples: 26516480 | consumed tokens: 54305751040 | elapsed time per iteration (s): 1.03 | learning rate: 3.340E-05 | global batch size: 256 | lm loss: 1.929104E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.852 | TFLOPs: 40.96 | 15: iteration 103590/ 125429 | consumed samples: 26519040 | consumed tokens: 54310993920 | elapsed time per iteration (s): 1.02 | learning rate: 3.339E-05 | global batch size: 256 | lm loss: 1.902350E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.517 | TFLOPs: 41.40 | 15: iteration 103600/ 125429 | consumed samples: 26521600 | consumed tokens: 54316236800 | elapsed time per iteration (s): 1.03 | learning rate: 3.338E-05 | global batch size: 256 | lm loss: 1.945301E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.370 | TFLOPs: 40.88 | 15: iteration 103610/ 125429 | consumed samples: 26524160 | consumed tokens: 54321479680 | elapsed time per iteration (s): 1.07 | learning rate: 3.337E-05 | global batch size: 256 | lm loss: 1.907523E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.180 | TFLOPs: 39.53 | 15: iteration 103620/ 125429 | consumed samples: 26526720 | consumed tokens: 54326722560 | elapsed time per iteration (s): 1.05 | learning rate: 3.336E-05 | global batch size: 256 | lm loss: 1.899800E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.291 | TFLOPs: 40.21 | 15: iteration 103630/ 125429 | consumed samples: 26529280 | consumed tokens: 54331965440 | elapsed time per iteration (s): 1.05 | learning rate: 3.334E-05 | global batch size: 256 | lm loss: 1.921926E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.756 | TFLOPs: 40.12 | 15: iteration 103640/ 125429 | consumed samples: 26531840 | consumed tokens: 54337208320 | elapsed time per iteration (s): 1.03 | learning rate: 3.333E-05 | global batch size: 256 | lm loss: 1.891808E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.703 | TFLOPs: 40.93 | 15: iteration 103650/ 125429 | consumed samples: 26534400 | consumed tokens: 54342451200 | elapsed time per iteration (s): 1.07 | learning rate: 3.332E-05 | global batch size: 256 | lm loss: 1.892926E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.568 | TFLOPs: 39.59 | 15: iteration 103660/ 125429 | consumed samples: 26536960 | consumed tokens: 54347694080 | elapsed time per iteration (s): 1.03 | learning rate: 3.331E-05 | global batch size: 256 | lm loss: 1.899641E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.611 | TFLOPs: 40.92 | 15: iteration 103670/ 125429 | consumed samples: 26539520 | consumed tokens: 54352936960 | elapsed time per iteration (s): 1.06 | learning rate: 3.330E-05 | global batch size: 256 | lm loss: 1.910609E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.259 | TFLOPs: 40.04 | 15: iteration 103680/ 125429 | consumed samples: 26542080 | consumed tokens: 54358179840 | elapsed time per iteration (s): 1.04 | learning rate: 3.328E-05 | global batch size: 256 | lm loss: 1.921558E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.225 | TFLOPs: 40.69 | 15: iteration 103690/ 125429 | consumed samples: 26544640 | consumed tokens: 54363422720 | elapsed time per iteration (s): 1.08 | learning rate: 3.327E-05 | global batch size: 256 | lm loss: 1.913456E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.026 | TFLOPs: 39.17 | 15: iteration 103700/ 125429 | consumed samples: 26547200 | consumed tokens: 54368665600 | elapsed time per iteration (s): 1.06 | learning rate: 3.326E-05 | global batch size: 256 | lm loss: 1.883446E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.883 | TFLOPs: 39.81 | 15: iteration 103710/ 125429 | consumed samples: 26549760 | consumed tokens: 54373908480 | elapsed time per iteration (s): 1.04 | learning rate: 3.325E-05 | global batch size: 256 | lm loss: 1.896218E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.534 | TFLOPs: 40.74 | 15: iteration 103720/ 125429 | consumed samples: 26552320 | consumed tokens: 54379151360 | elapsed time per iteration (s): 1.06 | learning rate: 3.324E-05 | global batch size: 256 | lm loss: 1.873844E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.146 | TFLOPs: 40.02 | 15: iteration 103730/ 125429 | consumed samples: 26554880 | consumed tokens: 54384394240 | elapsed time per iteration (s): 1.04 | learning rate: 3.322E-05 | global batch size: 256 | lm loss: 1.897679E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.060 | TFLOPs: 40.66 | 15: iteration 103740/ 125429 | consumed samples: 26557440 | consumed tokens: 54389637120 | elapsed time per iteration (s): 1.15 | learning rate: 3.321E-05 | global batch size: 256 | lm loss: 1.887447E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 223.271 | TFLOPs: 36.90 | 15: iteration 103750/ 125429 | consumed samples: 26560000 | consumed tokens: 54394880000 | elapsed time per iteration (s): 1.09 | learning rate: 3.320E-05 | global batch size: 256 | lm loss: 1.932625E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.422 | TFLOPs: 38.74 | 15: iteration 103760/ 125429 | consumed samples: 26562560 | consumed tokens: 54400122880 | elapsed time per iteration (s): 1.05 | learning rate: 3.319E-05 | global batch size: 256 | lm loss: 1.891458E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.755 | TFLOPs: 40.28 | 15: iteration 103770/ 125429 | consumed samples: 26565120 | consumed tokens: 54405365760 | elapsed time per iteration (s): 1.04 | learning rate: 3.318E-05 | global batch size: 256 | lm loss: 1.880434E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.144 | TFLOPs: 40.51 | 15: iteration 103780/ 125429 | consumed samples: 26567680 | consumed tokens: 54410608640 | elapsed time per iteration (s): 1.12 | learning rate: 3.317E-05 | global batch size: 256 | lm loss: 1.932100E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.776 | TFLOPs: 37.64 | 15: iteration 103790/ 125429 | consumed samples: 26570240 | consumed tokens: 54415851520 | elapsed time per iteration (s): 1.09 | learning rate: 3.315E-05 | global batch size: 256 | lm loss: 1.881236E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.737 | TFLOPs: 38.96 | 15: iteration 103800/ 125429 | consumed samples: 26572800 | consumed tokens: 54421094400 | elapsed time per iteration (s): 1.05 | learning rate: 3.314E-05 | global batch size: 256 | lm loss: 1.923158E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.849 | TFLOPs: 40.30 | 15: iteration 103810/ 125429 | consumed samples: 26575360 | consumed tokens: 54426337280 | elapsed time per iteration (s): 1.09 | learning rate: 3.313E-05 | global batch size: 256 | lm loss: 1.901671E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.914 | TFLOPs: 38.82 | 15: iteration 103820/ 125429 | consumed samples: 26577920 | consumed tokens: 54431580160 | elapsed time per iteration (s): 1.07 | learning rate: 3.312E-05 | global batch size: 256 | lm loss: 1.911941E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.152 | TFLOPs: 39.36 | 15: iteration 103830/ 125429 | consumed samples: 26580480 | consumed tokens: 54436823040 | elapsed time per iteration (s): 1.07 | learning rate: 3.311E-05 | global batch size: 256 | lm loss: 1.924138E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.847 | TFLOPs: 39.47 | 15: iteration 103840/ 125429 | consumed samples: 26583040 | consumed tokens: 54442065920 | elapsed time per iteration (s): 1.04 | learning rate: 3.309E-05 | global batch size: 256 | lm loss: 1.893883E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.695 | TFLOPs: 40.60 | 15: iteration 103850/ 125429 | consumed samples: 26585600 | consumed tokens: 54447308800 | elapsed time per iteration (s): 1.03 | learning rate: 3.308E-05 | global batch size: 256 | lm loss: 1.913861E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.133 | TFLOPs: 41.01 | 15: iteration 103860/ 125429 | consumed samples: 26588160 | consumed tokens: 54452551680 | elapsed time per iteration (s): 1.02 | learning rate: 3.307E-05 | global batch size: 256 | lm loss: 1.886378E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.869 | TFLOPs: 41.29 | 15: iteration 103870/ 125429 | consumed samples: 26590720 | consumed tokens: 54457794560 | elapsed time per iteration (s): 1.06 | learning rate: 3.306E-05 | global batch size: 256 | lm loss: 1.899404E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.124 | TFLOPs: 40.01 | 15: iteration 103880/ 125429 | consumed samples: 26593280 | consumed tokens: 54463037440 | elapsed time per iteration (s): 1.02 | learning rate: 3.305E-05 | global batch size: 256 | lm loss: 1.897591E+00 | grad norm: 0.144 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.799 | TFLOPs: 41.28 | 15: iteration 103890/ 125429 | consumed samples: 26595840 | consumed tokens: 54468280320 | elapsed time per iteration (s): 1.05 | learning rate: 3.304E-05 | global batch size: 256 | lm loss: 1.939856E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.980 | TFLOPs: 40.32 | 15: iteration 103900/ 125429 | consumed samples: 26598400 | consumed tokens: 54473523200 | elapsed time per iteration (s): 1.10 | learning rate: 3.302E-05 | global batch size: 256 | lm loss: 1.909734E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.341 | TFLOPs: 38.40 | 15: iteration 103910/ 125429 | consumed samples: 26600960 | consumed tokens: 54478766080 | elapsed time per iteration (s): 1.02 | learning rate: 3.301E-05 | global batch size: 256 | lm loss: 1.893651E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.975 | TFLOPs: 41.64 | 15: iteration 103920/ 125429 | consumed samples: 26603520 | consumed tokens: 54484008960 | elapsed time per iteration (s): 1.02 | learning rate: 3.300E-05 | global batch size: 256 | lm loss: 1.882395E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.177 | TFLOPs: 41.51 | 15: iteration 103930/ 125429 | consumed samples: 26606080 | consumed tokens: 54489251840 | elapsed time per iteration (s): 1.04 | learning rate: 3.299E-05 | global batch size: 256 | lm loss: 1.907543E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.904 | TFLOPs: 40.64 | 15: iteration 103940/ 125429 | consumed samples: 26608640 | consumed tokens: 54494494720 | elapsed time per iteration (s): 1.06 | learning rate: 3.298E-05 | global batch size: 256 | lm loss: 1.909654E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.629 | TFLOPs: 39.77 | 15: iteration 103950/ 125429 | consumed samples: 26611200 | consumed tokens: 54499737600 | elapsed time per iteration (s): 1.13 | learning rate: 3.296E-05 | global batch size: 256 | lm loss: 1.935139E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.220 | TFLOPs: 37.55 | 15: iteration 103960/ 125429 | consumed samples: 26613760 | consumed tokens: 54504980480 | elapsed time per iteration (s): 1.02 | learning rate: 3.295E-05 | global batch size: 256 | lm loss: 1.895627E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.329 | TFLOPs: 41.37 | 15: iteration 103970/ 125429 | consumed samples: 26616320 | consumed tokens: 54510223360 | elapsed time per iteration (s): 1.05 | learning rate: 3.294E-05 | global batch size: 256 | lm loss: 1.905380E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.222 | TFLOPs: 40.19 | 15: iteration 103980/ 125429 | consumed samples: 26618880 | consumed tokens: 54515466240 | elapsed time per iteration (s): 1.05 | learning rate: 3.293E-05 | global batch size: 256 | lm loss: 1.878061E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.004 | TFLOPs: 40.32 | 15: iteration 103990/ 125429 | consumed samples: 26621440 | consumed tokens: 54520709120 | elapsed time per iteration (s): 1.08 | learning rate: 3.292E-05 | global batch size: 256 | lm loss: 1.895094E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.756 | TFLOPs: 39.13 | 0: [2022-11-27 02:49:13,704] [INFO] [logging.py:68:log_dist] [Rank 0] step=104000, skipped=0, lr=[3.290589046302177e-05, 3.290589046302177e-05, 3.290589046302177e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 104000/ 125429 | consumed samples: 26624000 | consumed tokens: 54525952000 | elapsed time per iteration (s): 1.05 | learning rate: 3.291E-05 | global batch size: 256 | lm loss: 1.903359E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.394 | TFLOPs: 40.39 | 0: steps: 104000 loss: 1.9049 iter time (s): 1.052 samples/sec: 243.448 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 104000 | lm loss value: 1.842513E+00 | lm loss PPL: 6.312383E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 104000 to checkpoints_1b5 0: [2022-11-27 02:49:14,066] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step104000 is begin to save! 0: [2022-11-27 02:49:14,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_01-model_00-model_states.pt... 0: [2022-11-27 02:49:14,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_01-model_00-model_states.pt. 0: [2022-11-27 02:49:14,351] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_03-model_00-model_states.pt... 0: [2022-11-27 02:49:14,459] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_03-model_00-model_states.pt. 0: [2022-11-27 02:49:14,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_04-model_00-model_states.pt... 0: [2022-11-27 02:49:14,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_04-model_00-model_states.pt. 0: [2022-11-27 02:49:14,575] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_05-model_00-model_states.pt... 0: [2022-11-27 02:49:14,686] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_05-model_00-model_states.pt. 0: [2022-11-27 02:49:14,687] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_06-model_00-model_states.pt... 0: [2022-11-27 02:49:14,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_06-model_00-model_states.pt. 0: [2022-11-27 02:49:14,799] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_07-model_00-model_states.pt... 0: [2022-11-27 02:49:14,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_07-model_00-model_states.pt. 0: [2022-11-27 02:49:14,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_08-model_00-model_states.pt... 0: [2022-11-27 02:49:15,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_08-model_00-model_states.pt. 0: [2022-11-27 02:49:15,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_09-model_00-model_states.pt... 0: [2022-11-27 02:49:15,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_09-model_00-model_states.pt. 0: [2022-11-27 02:49:15,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_10-model_00-model_states.pt... 0: [2022-11-27 02:49:15,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_10-model_00-model_states.pt. 0: [2022-11-27 02:49:15,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_11-model_00-model_states.pt... 0: [2022-11-27 02:49:15,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_11-model_00-model_states.pt. 0: [2022-11-27 02:49:15,340] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_12-model_00-model_states.pt... 0: [2022-11-27 02:49:15,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_12-model_00-model_states.pt. 0: [2022-11-27 02:49:15,446] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_13-model_00-model_states.pt... 0: [2022-11-27 02:49:15,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_13-model_00-model_states.pt. 0: [2022-11-27 02:49:15,554] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_14-model_00-model_states.pt... 0: [2022-11-27 02:49:15,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_14-model_00-model_states.pt. 0: [2022-11-27 02:49:15,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_15-model_00-model_states.pt... 0: [2022-11-27 02:49:15,768] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_15-model_00-model_states.pt. 0: [2022-11-27 02:49:15,768] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_16-model_00-model_states.pt... 0: [2022-11-27 02:49:15,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_16-model_00-model_states.pt. 0: [2022-11-27 02:49:15,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_17-model_00-model_states.pt... 0: [2022-11-27 02:49:15,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_17-model_00-model_states.pt. 0: [2022-11-27 02:49:15,981] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_18-model_00-model_states.pt... 0: [2022-11-27 02:49:16,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_18-model_00-model_states.pt. 0: [2022-11-27 02:49:16,086] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_19-model_00-model_states.pt... 0: [2022-11-27 02:49:16,194] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_19-model_00-model_states.pt. 0: [2022-11-27 02:49:16,195] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_20-model_00-model_states.pt... 0: [2022-11-27 02:49:16,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_20-model_00-model_states.pt. 0: [2022-11-27 02:49:16,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_21-model_00-model_states.pt... 0: [2022-11-27 02:49:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_21-model_00-model_states.pt. 0: [2022-11-27 02:49:16,409] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_22-model_00-model_states.pt... 0: [2022-11-27 02:49:16,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_22-model_00-model_states.pt. 0: [2022-11-27 02:49:16,518] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_23-model_00-model_states.pt... 0: [2022-11-27 02:49:16,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_23-model_00-model_states.pt. 0: [2022-11-27 02:49:16,624] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_24-model_00-model_states.pt... 0: [2022-11-27 02:49:16,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_24-model_00-model_states.pt. 0: [2022-11-27 02:49:16,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_25-model_00-model_states.pt... 0: [2022-11-27 02:49:16,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_25-model_00-model_states.pt. 0: [2022-11-27 02:49:16,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_26-model_00-model_states.pt... 0: [2022-11-27 02:49:16,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_26-model_00-model_states.pt. 0: [2022-11-27 02:49:16,941] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_27-model_00-model_states.pt... 0: [2022-11-27 02:49:17,054] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_27-model_00-model_states.pt. 0: [2022-11-27 02:49:17,055] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_28-model_00-model_states.pt... 0: [2022-11-27 02:49:17,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_28-model_00-model_states.pt. 0: [2022-11-27 02:49:17,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_29-model_00-model_states.pt... 0: [2022-11-27 02:49:17,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_29-model_00-model_states.pt. 0: [2022-11-27 02:49:17,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_30-model_00-model_states.pt... 0: [2022-11-27 02:49:17,373] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_30-model_00-model_states.pt. 0: [2022-11-27 02:49:17,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/layer_32-model_00-model_states.pt... 0: [2022-11-27 02:49:17,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/layer_32-model_00-model_states.pt. 0: [2022-11-27 02:49:17,378] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step104000/mp_rank_00_model_states.pt 0: [2022-11-27 02:49:17,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/mp_rank_00_model_states.pt... 0: [2022-11-27 02:49:17,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/mp_rank_00_model_states.pt. 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 8: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 14: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 2: [2022-11-27 02:49:17,422] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step104000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 10: [2022-11-27 02:49:17,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 02:49:17,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:49:17,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-27 02:49:17,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-27 02:49:17,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,593] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,593] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:49:17,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 02:49:17,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 10: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-27 02:49:17,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 02:49:17,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 02:49:17,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-27 02:49:17,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-27 02:49:17,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,603] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-27 02:49:17,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,604] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,604] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,609] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,609] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,609] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,611] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,612] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,599] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,597] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,597] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-27 02:49:17,602] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,602] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,602] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-27 02:49:17,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-27 02:49:17,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,607] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,607] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-27 02:49:17,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-27 02:49:17,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-27 02:49:17,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,613] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,613] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-27 02:49:17,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-27 02:49:17,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-27 02:49:17,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 8: [2022-11-27 02:49:17,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 02:49:17,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 02:49:17,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 02:49:17,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 02:49:17,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 02:49:17,617] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 10: [2022-11-27 02:49:17,617] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:49:17,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:49:17,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 02:49:17,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 02:49:17,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:49:17,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 02:49:17,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:49:17,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 02:49:17,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-27 02:49:17,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-27 02:49:17,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,600] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,600] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:49:17,601] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,601] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:49:17,603] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,604] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,616] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:49:17,616] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,616] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:49:17,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:49:17,621] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,621] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:49:17,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:49:17,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:49:17,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 11: [2022-11-27 02:49:17,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-27 02:49:17,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-27 02:49:17,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-27 02:49:17,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 02:49:17,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 02:49:17,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 7: [2022-11-27 02:49:17,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 5: [2022-11-27 02:49:17,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 02:49:17,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 02:49:17,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 11: [2022-11-27 02:49:17,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 02:49:17,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 3: [2022-11-27 02:49:17,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 02:49:17,637] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 02:49:17,637] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 6: [2022-11-27 02:49:17,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 02:49:17,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 02:49:17,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 9: [2022-11-27 02:49:17,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 02:49:17,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 02:49:17,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 1: [2022-11-27 02:49:17,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 02:49:17,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 02:49:17,646] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:49:17,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 13: [2022-11-27 02:49:17,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 02:49:17,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 02:49:17,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,658] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 02:49:17,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 15: [2022-11-27 02:49:17,660] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 02:49:17,660] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 4: [2022-11-27 02:49:17,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 02:49:17,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 02:49:17,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:49:17,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:49:17,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:49:17,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 02:49:17,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: [2022-11-27 02:49:17,808] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 02:49:17,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 02:49:17,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 02:49:17,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 2: [2022-11-27 02:49:17,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 02:49:17,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 02:49:17,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 14: [2022-11-27 02:49:17,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step104000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 12: [2022-11-27 02:49:17,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step104000 is ready now! 0: successfully saved checkpoint at iteration 104000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3893.64 15: iteration 104010/ 125429 | consumed samples: 26626560 | consumed tokens: 54531194880 | elapsed time per iteration (s): 1.46 | learning rate: 3.289E-05 | global batch size: 256 | lm loss: 1.915404E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.645 | TFLOPs: 29.03 | 15: iteration 104020/ 125429 | consumed samples: 26629120 | consumed tokens: 54536437760 | elapsed time per iteration (s): 1.06 | learning rate: 3.288E-05 | global batch size: 256 | lm loss: 1.921861E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.009 | TFLOPs: 39.99 | 15: iteration 104030/ 125429 | consumed samples: 26631680 | consumed tokens: 54541680640 | elapsed time per iteration (s): 1.04 | learning rate: 3.287E-05 | global batch size: 256 | lm loss: 1.882555E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.788 | TFLOPs: 40.62 | 15: iteration 104040/ 125429 | consumed samples: 26634240 | consumed tokens: 54546923520 | elapsed time per iteration (s): 1.06 | learning rate: 3.286E-05 | global batch size: 256 | lm loss: 1.890324E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.800 | TFLOPs: 39.79 | 15: iteration 104050/ 125429 | consumed samples: 26636800 | consumed tokens: 54552166400 | elapsed time per iteration (s): 1.04 | learning rate: 3.285E-05 | global batch size: 256 | lm loss: 1.902219E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.240 | TFLOPs: 40.86 | 15: iteration 104060/ 125429 | consumed samples: 26639360 | consumed tokens: 54557409280 | elapsed time per iteration (s): 1.02 | learning rate: 3.284E-05 | global batch size: 256 | lm loss: 1.891176E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.729 | TFLOPs: 41.43 | 15: iteration 104070/ 125429 | consumed samples: 26641920 | consumed tokens: 54562652160 | elapsed time per iteration (s): 1.04 | learning rate: 3.282E-05 | global batch size: 256 | lm loss: 1.914057E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.780 | TFLOPs: 40.78 | 15: iteration 104080/ 125429 | consumed samples: 26644480 | consumed tokens: 54567895040 | elapsed time per iteration (s): 1.10 | learning rate: 3.281E-05 | global batch size: 256 | lm loss: 1.874805E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.285 | TFLOPs: 38.55 | 15: iteration 104090/ 125429 | consumed samples: 26647040 | consumed tokens: 54573137920 | elapsed time per iteration (s): 1.04 | learning rate: 3.280E-05 | global batch size: 256 | lm loss: 1.913024E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.143 | TFLOPs: 40.51 | 15: iteration 104100/ 125429 | consumed samples: 26649600 | consumed tokens: 54578380800 | elapsed time per iteration (s): 1.04 | learning rate: 3.279E-05 | global batch size: 256 | lm loss: 1.921256E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.003 | TFLOPs: 40.49 | 15: iteration 104110/ 125429 | consumed samples: 26652160 | consumed tokens: 54583623680 | elapsed time per iteration (s): 1.04 | learning rate: 3.278E-05 | global batch size: 256 | lm loss: 1.888032E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.441 | TFLOPs: 40.73 | 15: iteration 104120/ 125429 | consumed samples: 26654720 | consumed tokens: 54588866560 | elapsed time per iteration (s): 1.03 | learning rate: 3.277E-05 | global batch size: 256 | lm loss: 1.909663E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.778 | TFLOPs: 40.95 | 15: iteration 104130/ 125429 | consumed samples: 26657280 | consumed tokens: 54594109440 | elapsed time per iteration (s): 1.04 | learning rate: 3.275E-05 | global batch size: 256 | lm loss: 1.906849E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.668 | TFLOPs: 40.60 | 15: iteration 104140/ 125429 | consumed samples: 26659840 | consumed tokens: 54599352320 | elapsed time per iteration (s): 1.05 | learning rate: 3.274E-05 | global batch size: 256 | lm loss: 1.896875E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.215 | TFLOPs: 40.19 | 15: iteration 104150/ 125429 | consumed samples: 26662400 | consumed tokens: 54604595200 | elapsed time per iteration (s): 1.04 | learning rate: 3.273E-05 | global batch size: 256 | lm loss: 1.909016E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.576 | TFLOPs: 40.58 | 15: iteration 104160/ 125429 | consumed samples: 26664960 | consumed tokens: 54609838080 | elapsed time per iteration (s): 1.05 | learning rate: 3.272E-05 | global batch size: 256 | lm loss: 1.912989E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.279 | TFLOPs: 40.20 | 15: iteration 104170/ 125429 | consumed samples: 26667520 | consumed tokens: 54615080960 | elapsed time per iteration (s): 1.06 | learning rate: 3.271E-05 | global batch size: 256 | lm loss: 1.883465E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.089 | TFLOPs: 39.84 | 15: iteration 104180/ 125429 | consumed samples: 26670080 | consumed tokens: 54620323840 | elapsed time per iteration (s): 1.12 | learning rate: 3.270E-05 | global batch size: 256 | lm loss: 1.903706E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.235 | TFLOPs: 37.72 | 15: iteration 104190/ 125429 | consumed samples: 26672640 | consumed tokens: 54625566720 | elapsed time per iteration (s): 1.07 | learning rate: 3.268E-05 | global batch size: 256 | lm loss: 1.896973E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.930 | TFLOPs: 39.49 | 15: iteration 104200/ 125429 | consumed samples: 26675200 | consumed tokens: 54630809600 | elapsed time per iteration (s): 1.04 | learning rate: 3.267E-05 | global batch size: 256 | lm loss: 1.925523E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.761 | TFLOPs: 40.61 | 15: iteration 104210/ 125429 | consumed samples: 26677760 | consumed tokens: 54636052480 | elapsed time per iteration (s): 1.03 | learning rate: 3.266E-05 | global batch size: 256 | lm loss: 1.907482E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.531 | TFLOPs: 41.07 | 15: iteration 104220/ 125429 | consumed samples: 26680320 | consumed tokens: 54641295360 | elapsed time per iteration (s): 1.06 | learning rate: 3.265E-05 | global batch size: 256 | lm loss: 1.908888E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.896 | TFLOPs: 39.98 | 15: iteration 104230/ 125429 | consumed samples: 26682880 | consumed tokens: 54646538240 | elapsed time per iteration (s): 1.08 | learning rate: 3.264E-05 | global batch size: 256 | lm loss: 1.877300E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.107 | TFLOPs: 39.18 | 15: iteration 104240/ 125429 | consumed samples: 26685440 | consumed tokens: 54651781120 | elapsed time per iteration (s): 1.06 | learning rate: 3.263E-05 | global batch size: 256 | lm loss: 1.873862E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.355 | TFLOPs: 40.05 | 15: iteration 104250/ 125429 | consumed samples: 26688000 | consumed tokens: 54657024000 | elapsed time per iteration (s): 1.04 | learning rate: 3.261E-05 | global batch size: 256 | lm loss: 1.908653E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.835 | TFLOPs: 40.63 | 15: iteration 104260/ 125429 | consumed samples: 26690560 | consumed tokens: 54662266880 | elapsed time per iteration (s): 1.05 | learning rate: 3.260E-05 | global batch size: 256 | lm loss: 1.905989E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.849 | TFLOPs: 40.46 | 15: iteration 104270/ 125429 | consumed samples: 26693120 | consumed tokens: 54667509760 | elapsed time per iteration (s): 1.07 | learning rate: 3.259E-05 | global batch size: 256 | lm loss: 1.888214E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.592 | TFLOPs: 39.43 | 15: iteration 104280/ 125429 | consumed samples: 26695680 | consumed tokens: 54672752640 | elapsed time per iteration (s): 1.07 | learning rate: 3.258E-05 | global batch size: 256 | lm loss: 1.897036E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.484 | TFLOPs: 39.41 | 15: iteration 104290/ 125429 | consumed samples: 26698240 | consumed tokens: 54677995520 | elapsed time per iteration (s): 1.35 | learning rate: 3.257E-05 | global batch size: 256 | lm loss: 1.878701E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 189.460 | TFLOPs: 31.31 | 15: iteration 104300/ 125429 | consumed samples: 26700800 | consumed tokens: 54683238400 | elapsed time per iteration (s): 1.04 | learning rate: 3.256E-05 | global batch size: 256 | lm loss: 1.908381E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.018 | TFLOPs: 40.49 | 15: iteration 104310/ 125429 | consumed samples: 26703360 | consumed tokens: 54688481280 | elapsed time per iteration (s): 1.04 | learning rate: 3.254E-05 | global batch size: 256 | lm loss: 1.920135E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.666 | TFLOPs: 40.60 | 15: iteration 104320/ 125429 | consumed samples: 26705920 | consumed tokens: 54693724160 | elapsed time per iteration (s): 1.03 | learning rate: 3.253E-05 | global batch size: 256 | lm loss: 1.915203E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.636 | TFLOPs: 41.09 | 15: iteration 104330/ 125429 | consumed samples: 26708480 | consumed tokens: 54698967040 | elapsed time per iteration (s): 1.04 | learning rate: 3.252E-05 | global batch size: 256 | lm loss: 1.911515E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.057 | TFLOPs: 40.66 | 15: iteration 104340/ 125429 | consumed samples: 26711040 | consumed tokens: 54704209920 | elapsed time per iteration (s): 1.06 | learning rate: 3.251E-05 | global batch size: 256 | lm loss: 1.897668E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.573 | TFLOPs: 39.76 | 15: iteration 104350/ 125429 | consumed samples: 26713600 | consumed tokens: 54709452800 | elapsed time per iteration (s): 1.04 | learning rate: 3.250E-05 | global batch size: 256 | lm loss: 1.898966E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.285 | TFLOPs: 40.87 | 15: iteration 104360/ 125429 | consumed samples: 26716160 | consumed tokens: 54714695680 | elapsed time per iteration (s): 1.02 | learning rate: 3.249E-05 | global batch size: 256 | lm loss: 1.900090E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.753 | TFLOPs: 41.44 | 15: iteration 104370/ 125429 | consumed samples: 26718720 | consumed tokens: 54719938560 | elapsed time per iteration (s): 1.08 | learning rate: 3.247E-05 | global batch size: 256 | lm loss: 1.938654E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.298 | TFLOPs: 39.05 | 15: iteration 104380/ 125429 | consumed samples: 26721280 | consumed tokens: 54725181440 | elapsed time per iteration (s): 1.05 | learning rate: 3.246E-05 | global batch size: 256 | lm loss: 1.908746E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.596 | TFLOPs: 40.42 | 15: iteration 104390/ 125429 | consumed samples: 26723840 | consumed tokens: 54730424320 | elapsed time per iteration (s): 1.12 | learning rate: 3.245E-05 | global batch size: 256 | lm loss: 1.906460E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.741 | TFLOPs: 37.80 | 15: iteration 104400/ 125429 | consumed samples: 26726400 | consumed tokens: 54735667200 | elapsed time per iteration (s): 1.02 | learning rate: 3.244E-05 | global batch size: 256 | lm loss: 1.887169E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.582 | TFLOPs: 41.41 | 15: iteration 104410/ 125429 | consumed samples: 26728960 | consumed tokens: 54740910080 | elapsed time per iteration (s): 1.03 | learning rate: 3.243E-05 | global batch size: 256 | lm loss: 1.915947E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.028 | TFLOPs: 40.99 | 15: iteration 104420/ 125429 | consumed samples: 26731520 | consumed tokens: 54746152960 | elapsed time per iteration (s): 1.04 | learning rate: 3.242E-05 | global batch size: 256 | lm loss: 1.886513E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.937 | TFLOPs: 40.64 | 15: iteration 104430/ 125429 | consumed samples: 26734080 | consumed tokens: 54751395840 | elapsed time per iteration (s): 1.03 | learning rate: 3.241E-05 | global batch size: 256 | lm loss: 1.890636E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.215 | TFLOPs: 41.02 | 15: iteration 104440/ 125429 | consumed samples: 26736640 | consumed tokens: 54756638720 | elapsed time per iteration (s): 1.02 | learning rate: 3.239E-05 | global batch size: 256 | lm loss: 1.932367E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.148 | TFLOPs: 41.34 | 15: iteration 104450/ 125429 | consumed samples: 26739200 | consumed tokens: 54761881600 | elapsed time per iteration (s): 1.02 | learning rate: 3.238E-05 | global batch size: 256 | lm loss: 1.910121E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.996 | TFLOPs: 41.31 | 15: iteration 104460/ 125429 | consumed samples: 26741760 | consumed tokens: 54767124480 | elapsed time per iteration (s): 1.04 | learning rate: 3.237E-05 | global batch size: 256 | lm loss: 1.903921E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.144 | TFLOPs: 40.84 | 15: iteration 104470/ 125429 | consumed samples: 26744320 | consumed tokens: 54772367360 | elapsed time per iteration (s): 1.05 | learning rate: 3.236E-05 | global batch size: 256 | lm loss: 1.900684E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.000 | TFLOPs: 40.16 | 15: iteration 104480/ 125429 | consumed samples: 26746880 | consumed tokens: 54777610240 | elapsed time per iteration (s): 1.03 | learning rate: 3.235E-05 | global batch size: 256 | lm loss: 1.918246E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.021 | TFLOPs: 40.99 | 15: iteration 104490/ 125429 | consumed samples: 26749440 | consumed tokens: 54782853120 | elapsed time per iteration (s): 1.08 | learning rate: 3.234E-05 | global batch size: 256 | lm loss: 1.900340E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.874 | TFLOPs: 39.15 | 15: iteration 104500/ 125429 | consumed samples: 26752000 | consumed tokens: 54788096000 | elapsed time per iteration (s): 1.03 | learning rate: 3.232E-05 | global batch size: 256 | lm loss: 1.880361E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.651 | TFLOPs: 41.26 | 15: iteration 104510/ 125429 | consumed samples: 26754560 | consumed tokens: 54793338880 | elapsed time per iteration (s): 1.02 | learning rate: 3.231E-05 | global batch size: 256 | lm loss: 1.907048E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.893 | TFLOPs: 41.30 | 15: iteration 104520/ 125429 | consumed samples: 26757120 | consumed tokens: 54798581760 | elapsed time per iteration (s): 1.09 | learning rate: 3.230E-05 | global batch size: 256 | lm loss: 1.906270E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.023 | TFLOPs: 38.67 | 15: iteration 104530/ 125429 | consumed samples: 26759680 | consumed tokens: 54803824640 | elapsed time per iteration (s): 1.02 | learning rate: 3.229E-05 | global batch size: 256 | lm loss: 1.902236E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.061 | TFLOPs: 41.32 | 15: iteration 104540/ 125429 | consumed samples: 26762240 | consumed tokens: 54809067520 | elapsed time per iteration (s): 1.10 | learning rate: 3.228E-05 | global batch size: 256 | lm loss: 1.925978E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.716 | TFLOPs: 38.46 | 15: iteration 104550/ 125429 | consumed samples: 26764800 | consumed tokens: 54814310400 | elapsed time per iteration (s): 1.04 | learning rate: 3.227E-05 | global batch size: 256 | lm loss: 1.916207E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.069 | TFLOPs: 40.50 | 15: iteration 104560/ 125429 | consumed samples: 26767360 | consumed tokens: 54819553280 | elapsed time per iteration (s): 1.04 | learning rate: 3.226E-05 | global batch size: 256 | lm loss: 1.919087E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.806 | TFLOPs: 40.79 | 15: iteration 104570/ 125429 | consumed samples: 26769920 | consumed tokens: 54824796160 | elapsed time per iteration (s): 1.03 | learning rate: 3.224E-05 | global batch size: 256 | lm loss: 1.877874E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.887 | TFLOPs: 40.97 | 15: iteration 104580/ 125429 | consumed samples: 26772480 | consumed tokens: 54830039040 | elapsed time per iteration (s): 1.11 | learning rate: 3.223E-05 | global batch size: 256 | lm loss: 1.915276E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.342 | TFLOPs: 38.07 | 15: iteration 104590/ 125429 | consumed samples: 26775040 | consumed tokens: 54835281920 | elapsed time per iteration (s): 1.09 | learning rate: 3.222E-05 | global batch size: 256 | lm loss: 1.901813E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.889 | TFLOPs: 38.98 | 15: iteration 104600/ 125429 | consumed samples: 26777600 | consumed tokens: 54840524800 | elapsed time per iteration (s): 1.05 | learning rate: 3.221E-05 | global batch size: 256 | lm loss: 1.889266E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.640 | TFLOPs: 40.43 | 15: iteration 104610/ 125429 | consumed samples: 26780160 | consumed tokens: 54845767680 | elapsed time per iteration (s): 1.05 | learning rate: 3.220E-05 | global batch size: 256 | lm loss: 1.873253E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.391 | TFLOPs: 40.22 | 15: iteration 104620/ 125429 | consumed samples: 26782720 | consumed tokens: 54851010560 | elapsed time per iteration (s): 1.04 | learning rate: 3.219E-05 | global batch size: 256 | lm loss: 1.927212E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.937 | TFLOPs: 40.81 | 15: iteration 104630/ 125429 | consumed samples: 26785280 | consumed tokens: 54856253440 | elapsed time per iteration (s): 1.08 | learning rate: 3.218E-05 | global batch size: 256 | lm loss: 1.902859E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.436 | TFLOPs: 39.07 | 15: iteration 104640/ 125429 | consumed samples: 26787840 | consumed tokens: 54861496320 | elapsed time per iteration (s): 1.04 | learning rate: 3.216E-05 | global batch size: 256 | lm loss: 1.900956E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.155 | TFLOPs: 40.84 | 15: iteration 104650/ 125429 | consumed samples: 26790400 | consumed tokens: 54866739200 | elapsed time per iteration (s): 1.06 | learning rate: 3.215E-05 | global batch size: 256 | lm loss: 1.872465E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.610 | TFLOPs: 40.09 | 15: iteration 104660/ 125429 | consumed samples: 26792960 | consumed tokens: 54871982080 | elapsed time per iteration (s): 1.04 | learning rate: 3.214E-05 | global batch size: 256 | lm loss: 1.894832E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.649 | TFLOPs: 40.60 | 15: iteration 104670/ 125429 | consumed samples: 26795520 | consumed tokens: 54877224960 | elapsed time per iteration (s): 1.03 | learning rate: 3.213E-05 | global batch size: 256 | lm loss: 1.914728E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.491 | TFLOPs: 40.90 | 15: iteration 104680/ 125429 | consumed samples: 26798080 | consumed tokens: 54882467840 | elapsed time per iteration (s): 1.04 | learning rate: 3.212E-05 | global batch size: 256 | lm loss: 1.916093E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.630 | TFLOPs: 40.76 | 15: iteration 104690/ 125429 | consumed samples: 26800640 | consumed tokens: 54887710720 | elapsed time per iteration (s): 1.02 | learning rate: 3.211E-05 | global batch size: 256 | lm loss: 1.885250E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.301 | TFLOPs: 41.36 | 15: iteration 104700/ 125429 | consumed samples: 26803200 | consumed tokens: 54892953600 | elapsed time per iteration (s): 1.07 | learning rate: 3.210E-05 | global batch size: 256 | lm loss: 1.905983E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.325 | TFLOPs: 39.72 | 15: iteration 104710/ 125429 | consumed samples: 26805760 | consumed tokens: 54898196480 | elapsed time per iteration (s): 1.03 | learning rate: 3.208E-05 | global batch size: 256 | lm loss: 1.915825E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.361 | TFLOPs: 41.04 | 15: iteration 104720/ 125429 | consumed samples: 26808320 | consumed tokens: 54903439360 | elapsed time per iteration (s): 1.03 | learning rate: 3.207E-05 | global batch size: 256 | lm loss: 1.925051E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.814 | TFLOPs: 40.95 | 15: iteration 104730/ 125429 | consumed samples: 26810880 | consumed tokens: 54908682240 | elapsed time per iteration (s): 1.09 | learning rate: 3.206E-05 | global batch size: 256 | lm loss: 1.936572E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.547 | TFLOPs: 38.76 | 15: iteration 104740/ 125429 | consumed samples: 26813440 | consumed tokens: 54913925120 | elapsed time per iteration (s): 1.05 | learning rate: 3.205E-05 | global batch size: 256 | lm loss: 1.909446E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.345 | TFLOPs: 40.38 | 15: iteration 104750/ 125429 | consumed samples: 26816000 | consumed tokens: 54919168000 | elapsed time per iteration (s): 1.03 | learning rate: 3.204E-05 | global batch size: 256 | lm loss: 1.867241E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.388 | TFLOPs: 41.05 | 15: iteration 104760/ 125429 | consumed samples: 26818560 | consumed tokens: 54924410880 | elapsed time per iteration (s): 1.04 | learning rate: 3.203E-05 | global batch size: 256 | lm loss: 1.895873E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.314 | TFLOPs: 40.54 | 15: iteration 104770/ 125429 | consumed samples: 26821120 | consumed tokens: 54929653760 | elapsed time per iteration (s): 1.07 | learning rate: 3.202E-05 | global batch size: 256 | lm loss: 1.904663E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.669 | TFLOPs: 39.44 | 15: iteration 104780/ 125429 | consumed samples: 26823680 | consumed tokens: 54934896640 | elapsed time per iteration (s): 1.08 | learning rate: 3.200E-05 | global batch size: 256 | lm loss: 1.893612E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.903 | TFLOPs: 39.32 | 15: iteration 104790/ 125429 | consumed samples: 26826240 | consumed tokens: 54940139520 | elapsed time per iteration (s): 1.03 | learning rate: 3.199E-05 | global batch size: 256 | lm loss: 1.905762E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.462 | TFLOPs: 40.90 | 15: iteration 104800/ 125429 | consumed samples: 26828800 | consumed tokens: 54945382400 | elapsed time per iteration (s): 1.05 | learning rate: 3.198E-05 | global batch size: 256 | lm loss: 1.901152E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.807 | TFLOPs: 40.13 | 15: iteration 104810/ 125429 | consumed samples: 26831360 | consumed tokens: 54950625280 | elapsed time per iteration (s): 1.04 | learning rate: 3.197E-05 | global batch size: 256 | lm loss: 1.923992E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.931 | TFLOPs: 40.64 | 15: iteration 104820/ 125429 | consumed samples: 26833920 | consumed tokens: 54955868160 | elapsed time per iteration (s): 1.03 | learning rate: 3.196E-05 | global batch size: 256 | lm loss: 1.935836E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.957 | TFLOPs: 41.14 | 15: iteration 104830/ 125429 | consumed samples: 26836480 | consumed tokens: 54961111040 | elapsed time per iteration (s): 1.02 | learning rate: 3.195E-05 | global batch size: 256 | lm loss: 1.885557E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.832 | TFLOPs: 41.45 | 15: iteration 104840/ 125429 | consumed samples: 26839040 | consumed tokens: 54966353920 | elapsed time per iteration (s): 1.02 | learning rate: 3.194E-05 | global batch size: 256 | lm loss: 1.943018E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.237 | TFLOPs: 41.52 | 15: iteration 104850/ 125429 | consumed samples: 26841600 | consumed tokens: 54971596800 | elapsed time per iteration (s): 1.04 | learning rate: 3.193E-05 | global batch size: 256 | lm loss: 1.910621E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.818 | TFLOPs: 40.62 | 15: iteration 104860/ 125429 | consumed samples: 26844160 | consumed tokens: 54976839680 | elapsed time per iteration (s): 1.06 | learning rate: 3.191E-05 | global batch size: 256 | lm loss: 1.922118E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.570 | TFLOPs: 40.09 | 15: iteration 104870/ 125429 | consumed samples: 26846720 | consumed tokens: 54982082560 | elapsed time per iteration (s): 1.05 | learning rate: 3.190E-05 | global batch size: 256 | lm loss: 1.887038E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.865 | TFLOPs: 40.47 | 15: iteration 104880/ 125429 | consumed samples: 26849280 | consumed tokens: 54987325440 | elapsed time per iteration (s): 1.07 | learning rate: 3.189E-05 | global batch size: 256 | lm loss: 1.927304E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.787 | TFLOPs: 39.63 | 15: iteration 104890/ 125429 | consumed samples: 26851840 | consumed tokens: 54992568320 | elapsed time per iteration (s): 1.03 | learning rate: 3.188E-05 | global batch size: 256 | lm loss: 1.946202E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.685 | TFLOPs: 40.93 | 15: iteration 104900/ 125429 | consumed samples: 26854400 | consumed tokens: 54997811200 | elapsed time per iteration (s): 1.04 | learning rate: 3.187E-05 | global batch size: 256 | lm loss: 1.892597E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.129 | TFLOPs: 40.84 | 15: iteration 104910/ 125429 | consumed samples: 26856960 | consumed tokens: 55003054080 | elapsed time per iteration (s): 1.03 | learning rate: 3.186E-05 | global batch size: 256 | lm loss: 1.894971E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.114 | TFLOPs: 41.00 | 15: iteration 104920/ 125429 | consumed samples: 26859520 | consumed tokens: 55008296960 | elapsed time per iteration (s): 1.02 | learning rate: 3.185E-05 | global batch size: 256 | lm loss: 1.899186E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.686 | TFLOPs: 41.59 | 15: iteration 104930/ 125429 | consumed samples: 26862080 | consumed tokens: 55013539840 | elapsed time per iteration (s): 1.03 | learning rate: 3.183E-05 | global batch size: 256 | lm loss: 1.934098E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.136 | TFLOPs: 41.01 | 15: iteration 104940/ 125429 | consumed samples: 26864640 | consumed tokens: 55018782720 | elapsed time per iteration (s): 1.05 | learning rate: 3.182E-05 | global batch size: 256 | lm loss: 1.877416E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.404 | TFLOPs: 40.39 | 15: iteration 104950/ 125429 | consumed samples: 26867200 | consumed tokens: 55024025600 | elapsed time per iteration (s): 1.06 | learning rate: 3.181E-05 | global batch size: 256 | lm loss: 1.894345E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.532 | TFLOPs: 39.75 | 15: iteration 104960/ 125429 | consumed samples: 26869760 | consumed tokens: 55029268480 | elapsed time per iteration (s): 1.05 | learning rate: 3.180E-05 | global batch size: 256 | lm loss: 1.916490E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.563 | TFLOPs: 40.25 | 15: iteration 104970/ 125429 | consumed samples: 26872320 | consumed tokens: 55034511360 | elapsed time per iteration (s): 1.04 | learning rate: 3.179E-05 | global batch size: 256 | lm loss: 1.887764E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.040 | TFLOPs: 40.83 | 15: iteration 104980/ 125429 | consumed samples: 26874880 | consumed tokens: 55039754240 | elapsed time per iteration (s): 1.07 | learning rate: 3.178E-05 | global batch size: 256 | lm loss: 1.915560E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.301 | TFLOPs: 39.71 | 15: iteration 104990/ 125429 | consumed samples: 26877440 | consumed tokens: 55044997120 | elapsed time per iteration (s): 1.10 | learning rate: 3.177E-05 | global batch size: 256 | lm loss: 1.919539E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.885 | TFLOPs: 38.49 | 15: iteration 105000/ 125429 | consumed samples: 26880000 | consumed tokens: 55050240000 | elapsed time per iteration (s): 1.02 | learning rate: 3.176E-05 | global batch size: 256 | lm loss: 1.879648E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.499 | TFLOPs: 41.56 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 105000 | lm loss value: 1.811885E+00 | lm loss PPL: 6.121974E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 105000 to checkpoints_1b5 0: [2022-11-27 03:06:50,319] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step105000 is begin to save! 0: [2022-11-27 03:06:50,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:06:50,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:06:50,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:06:50,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:06:50,678] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:06:50,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:06:50,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:06:50,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:06:50,886] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:06:50,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:06:50,990] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:06:51,094] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:06:51,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:06:51,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:06:51,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:06:51,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:06:51,304] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:06:51,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:06:51,407] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:06:51,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:06:51,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:06:51,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:06:51,614] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:06:51,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:06:51,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:06:51,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:06:51,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:06:51,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:06:51,944] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:06:52,052] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:06:52,052] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:06:52,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:06:52,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:06:52,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:06:52,273] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:06:52,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:06:52,378] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:06:52,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:06:52,487] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:06:52,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:06:52,594] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:06:52,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:06:52,706] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:06:52,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:06:52,815] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:06:52,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:06:52,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:06:53,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:06:53,031] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:06:53,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:06:53,143] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:06:53,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:06:53,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:06:53,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:06:53,350] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_29-model_00-model_states.pt... 0: [2022-11-27 03:06:53,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_29-model_00-model_states.pt. 0: [2022-11-27 03:06:53,458] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:06:53,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:06:53,563] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/layer_32-model_00-model_states.pt... 0: [2022-11-27 03:06:53,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/layer_32-model_00-model_states.pt. 0: [2022-11-27 03:06:53,569] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step105000/mp_rank_00_model_states.pt 0: [2022-11-27 03:06:53,569] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:06:53,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:06:53,611] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step105000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:06:53,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-27 03:06:53,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,781] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-27 03:06:53,774] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,774] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,774] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-27 03:06:53,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-27 03:06:53,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,788] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,788] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,791] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,792] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,792] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,792] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,793] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,793] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,781] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,781] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:06:53,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,786] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:06:53,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,795] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,795] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-27 03:06:53,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-27 03:06:53,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-27 03:06:53,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,801] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,801] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-27 03:06:53,803] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 7: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:06:53,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 2: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:06:53,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 5: [2022-11-27 03:06:53,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:06:53,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 03:06:53,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,772] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,772] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,773] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:06:53,773] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,773] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:06:53,782] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:06:53,783] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,783] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:06:53,790] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,790] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:06:53,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:06:53,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 15: [2022-11-27 03:06:53,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:06:53,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 03:06:53,809] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 14: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:06:53,804] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 03:06:53,804] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,809] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,828] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 8: [2022-11-27 03:06:53,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:06:53,828] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 03:06:53,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,816] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 4: [2022-11-27 03:06:53,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:06:53,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 03:06:53,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:06:53,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:06:53,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,846] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,846] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 10: [2022-11-27 03:06:53,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:06:53,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 03:06:53,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 1: [2022-11-27 03:06:53,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:06:53,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:06:53,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:06:53,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 6: [2022-11-27 03:06:53,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:06:53,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 03:06:53,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 3: [2022-11-27 03:06:53,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: [2022-11-27 03:06:53,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 03:06:53,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,000] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:06:54,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 12: [2022-11-27 03:06:54,034] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 03:06:54,034] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 11: [2022-11-27 03:06:54,050] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 03:06:54,051] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,102] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,104] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,104] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,105] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,106] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 9: [2022-11-27 03:06:54,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:06:54,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 03:06:54,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,135] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,135] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:06:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,136] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step105000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 13: [2022-11-27 03:06:54,136] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step105000 is ready now! 0: successfully saved checkpoint at iteration 105000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3857.84 15: iteration 105010/ 125429 | consumed samples: 26882560 | consumed tokens: 55055482880 | elapsed time per iteration (s): 1.50 | learning rate: 3.174E-05 | global batch size: 256 | lm loss: 1.921671E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 170.337 | TFLOPs: 28.15 | 15: iteration 105020/ 125429 | consumed samples: 26885120 | consumed tokens: 55060725760 | elapsed time per iteration (s): 1.02 | learning rate: 3.173E-05 | global batch size: 256 | lm loss: 1.904928E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.278 | TFLOPs: 41.36 | 15: iteration 105030/ 125429 | consumed samples: 26887680 | consumed tokens: 55065968640 | elapsed time per iteration (s): 1.03 | learning rate: 3.172E-05 | global batch size: 256 | lm loss: 1.903893E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.980 | TFLOPs: 41.15 | 15: iteration 105040/ 125429 | consumed samples: 26890240 | consumed tokens: 55071211520 | elapsed time per iteration (s): 1.02 | learning rate: 3.171E-05 | global batch size: 256 | lm loss: 1.889471E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.877 | TFLOPs: 41.29 | 15: iteration 105050/ 125429 | consumed samples: 26892800 | consumed tokens: 55076454400 | elapsed time per iteration (s): 1.04 | learning rate: 3.170E-05 | global batch size: 256 | lm loss: 1.897440E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.963 | TFLOPs: 40.81 | 15: iteration 105060/ 125429 | consumed samples: 26895360 | consumed tokens: 55081697280 | elapsed time per iteration (s): 1.04 | learning rate: 3.169E-05 | global batch size: 256 | lm loss: 1.919934E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.632 | TFLOPs: 40.76 | 15: iteration 105070/ 125429 | consumed samples: 26897920 | consumed tokens: 55086940160 | elapsed time per iteration (s): 1.03 | learning rate: 3.168E-05 | global batch size: 256 | lm loss: 1.906818E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.661 | TFLOPs: 41.26 | 15: iteration 105080/ 125429 | consumed samples: 26900480 | consumed tokens: 55092183040 | elapsed time per iteration (s): 1.05 | learning rate: 3.167E-05 | global batch size: 256 | lm loss: 1.907953E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.381 | TFLOPs: 40.39 | 15: iteration 105090/ 125429 | consumed samples: 26903040 | consumed tokens: 55097425920 | elapsed time per iteration (s): 1.02 | learning rate: 3.165E-05 | global batch size: 256 | lm loss: 1.906850E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.170 | TFLOPs: 41.34 | 15: iteration 105100/ 125429 | consumed samples: 26905600 | consumed tokens: 55102668800 | elapsed time per iteration (s): 1.03 | learning rate: 3.164E-05 | global batch size: 256 | lm loss: 1.879906E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.420 | TFLOPs: 41.05 | 15: iteration 105110/ 125429 | consumed samples: 26908160 | consumed tokens: 55107911680 | elapsed time per iteration (s): 1.03 | learning rate: 3.163E-05 | global batch size: 256 | lm loss: 1.888643E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.667 | TFLOPs: 41.09 | 15: iteration 105120/ 125429 | consumed samples: 26910720 | consumed tokens: 55113154560 | elapsed time per iteration (s): 1.03 | learning rate: 3.162E-05 | global batch size: 256 | lm loss: 1.911103E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.747 | TFLOPs: 41.27 | 15: iteration 105130/ 125429 | consumed samples: 26913280 | consumed tokens: 55118397440 | elapsed time per iteration (s): 1.05 | learning rate: 3.161E-05 | global batch size: 256 | lm loss: 1.919606E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.884 | TFLOPs: 40.14 | 15: iteration 105140/ 125429 | consumed samples: 26915840 | consumed tokens: 55123640320 | elapsed time per iteration (s): 1.05 | learning rate: 3.160E-05 | global batch size: 256 | lm loss: 1.898735E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.426 | TFLOPs: 40.39 | 15: iteration 105150/ 125429 | consumed samples: 26918400 | consumed tokens: 55128883200 | elapsed time per iteration (s): 1.04 | learning rate: 3.159E-05 | global batch size: 256 | lm loss: 1.862062E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.705 | TFLOPs: 40.60 | 15: iteration 105160/ 125429 | consumed samples: 26920960 | consumed tokens: 55134126080 | elapsed time per iteration (s): 1.03 | learning rate: 3.158E-05 | global batch size: 256 | lm loss: 1.905772E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.278 | TFLOPs: 41.03 | 15: iteration 105170/ 125429 | consumed samples: 26923520 | consumed tokens: 55139368960 | elapsed time per iteration (s): 1.04 | learning rate: 3.157E-05 | global batch size: 256 | lm loss: 1.912991E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.008 | TFLOPs: 40.65 | 15: iteration 105180/ 125429 | consumed samples: 26926080 | consumed tokens: 55144611840 | elapsed time per iteration (s): 1.05 | learning rate: 3.155E-05 | global batch size: 256 | lm loss: 1.922117E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.803 | TFLOPs: 40.29 | 15: iteration 105190/ 125429 | consumed samples: 26928640 | consumed tokens: 55149854720 | elapsed time per iteration (s): 1.05 | learning rate: 3.154E-05 | global batch size: 256 | lm loss: 1.908205E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.804 | TFLOPs: 40.13 | 15: iteration 105200/ 125429 | consumed samples: 26931200 | consumed tokens: 55155097600 | elapsed time per iteration (s): 1.06 | learning rate: 3.153E-05 | global batch size: 256 | lm loss: 1.903260E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.735 | TFLOPs: 39.95 | 15: iteration 105210/ 125429 | consumed samples: 26933760 | consumed tokens: 55160340480 | elapsed time per iteration (s): 1.19 | learning rate: 3.152E-05 | global batch size: 256 | lm loss: 1.901451E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.582 | TFLOPs: 35.46 | 15: iteration 105220/ 125429 | consumed samples: 26936320 | consumed tokens: 55165583360 | elapsed time per iteration (s): 1.03 | learning rate: 3.151E-05 | global batch size: 256 | lm loss: 1.929051E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.430 | TFLOPs: 41.05 | 15: iteration 105230/ 125429 | consumed samples: 26938880 | consumed tokens: 55170826240 | elapsed time per iteration (s): 1.05 | learning rate: 3.150E-05 | global batch size: 256 | lm loss: 1.923267E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.760 | TFLOPs: 40.28 | 15: iteration 105240/ 125429 | consumed samples: 26941440 | consumed tokens: 55176069120 | elapsed time per iteration (s): 1.03 | learning rate: 3.149E-05 | global batch size: 256 | lm loss: 1.913152E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.420 | TFLOPs: 40.89 | 15: iteration 105250/ 125429 | consumed samples: 26944000 | consumed tokens: 55181312000 | elapsed time per iteration (s): 1.08 | learning rate: 3.148E-05 | global batch size: 256 | lm loss: 1.914881E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.622 | TFLOPs: 39.10 | 15: iteration 105260/ 125429 | consumed samples: 26946560 | consumed tokens: 55186554880 | elapsed time per iteration (s): 1.04 | learning rate: 3.146E-05 | global batch size: 256 | lm loss: 1.894041E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.100 | TFLOPs: 40.67 | 15: iteration 105270/ 125429 | consumed samples: 26949120 | consumed tokens: 55191797760 | elapsed time per iteration (s): 1.04 | learning rate: 3.145E-05 | global batch size: 256 | lm loss: 1.899857E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.277 | TFLOPs: 40.86 | 15: iteration 105280/ 125429 | consumed samples: 26951680 | consumed tokens: 55197040640 | elapsed time per iteration (s): 1.04 | learning rate: 3.144E-05 | global batch size: 256 | lm loss: 1.880498E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.535 | TFLOPs: 40.58 | 15: iteration 105290/ 125429 | consumed samples: 26954240 | consumed tokens: 55202283520 | elapsed time per iteration (s): 1.04 | learning rate: 3.143E-05 | global batch size: 256 | lm loss: 1.907796E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.895 | TFLOPs: 40.64 | 15: iteration 105300/ 125429 | consumed samples: 26956800 | consumed tokens: 55207526400 | elapsed time per iteration (s): 1.04 | learning rate: 3.142E-05 | global batch size: 256 | lm loss: 1.915893E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.829 | TFLOPs: 40.63 | 15: iteration 105310/ 125429 | consumed samples: 26959360 | consumed tokens: 55212769280 | elapsed time per iteration (s): 1.04 | learning rate: 3.141E-05 | global batch size: 256 | lm loss: 1.893932E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.387 | TFLOPs: 40.55 | 15: iteration 105320/ 125429 | consumed samples: 26961920 | consumed tokens: 55218012160 | elapsed time per iteration (s): 1.05 | learning rate: 3.140E-05 | global batch size: 256 | lm loss: 1.872635E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.566 | TFLOPs: 40.42 | 15: iteration 105330/ 125429 | consumed samples: 26964480 | consumed tokens: 55223255040 | elapsed time per iteration (s): 1.06 | learning rate: 3.139E-05 | global batch size: 256 | lm loss: 1.872626E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.505 | TFLOPs: 39.91 | 15: iteration 105340/ 125429 | consumed samples: 26967040 | consumed tokens: 55228497920 | elapsed time per iteration (s): 1.03 | learning rate: 3.138E-05 | global batch size: 256 | lm loss: 1.913810E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.695 | TFLOPs: 41.10 | 15: iteration 105350/ 125429 | consumed samples: 26969600 | consumed tokens: 55233740800 | elapsed time per iteration (s): 1.11 | learning rate: 3.137E-05 | global batch size: 256 | lm loss: 1.899372E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.289 | TFLOPs: 38.22 | 15: iteration 105360/ 125429 | consumed samples: 26972160 | consumed tokens: 55238983680 | elapsed time per iteration (s): 1.02 | learning rate: 3.135E-05 | global batch size: 256 | lm loss: 1.905655E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.748 | TFLOPs: 41.44 | 15: iteration 105370/ 125429 | consumed samples: 26974720 | consumed tokens: 55244226560 | elapsed time per iteration (s): 1.04 | learning rate: 3.134E-05 | global batch size: 256 | lm loss: 1.898467E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.030 | TFLOPs: 40.49 | 15: iteration 105380/ 125429 | consumed samples: 26977280 | consumed tokens: 55249469440 | elapsed time per iteration (s): 1.05 | learning rate: 3.133E-05 | global batch size: 256 | lm loss: 1.904823E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.164 | TFLOPs: 40.18 | 15: iteration 105390/ 125429 | consumed samples: 26979840 | consumed tokens: 55254712320 | elapsed time per iteration (s): 1.02 | learning rate: 3.132E-05 | global batch size: 256 | lm loss: 1.914095E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.311 | TFLOPs: 41.37 | 15: iteration 105400/ 125429 | consumed samples: 26982400 | consumed tokens: 55259955200 | elapsed time per iteration (s): 1.04 | learning rate: 3.131E-05 | global batch size: 256 | lm loss: 1.910981E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.048 | TFLOPs: 40.50 | 15: iteration 105410/ 125429 | consumed samples: 26984960 | consumed tokens: 55265198080 | elapsed time per iteration (s): 1.04 | learning rate: 3.130E-05 | global batch size: 256 | lm loss: 1.912465E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.842 | TFLOPs: 40.63 | 15: iteration 105420/ 125429 | consumed samples: 26987520 | consumed tokens: 55270440960 | elapsed time per iteration (s): 1.05 | learning rate: 3.129E-05 | global batch size: 256 | lm loss: 1.915258E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.153 | TFLOPs: 40.35 | 15: iteration 105430/ 125429 | consumed samples: 26990080 | consumed tokens: 55275683840 | elapsed time per iteration (s): 1.04 | learning rate: 3.128E-05 | global batch size: 256 | lm loss: 1.917732E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.938 | TFLOPs: 40.81 | 15: iteration 105440/ 125429 | consumed samples: 26992640 | consumed tokens: 55280926720 | elapsed time per iteration (s): 1.05 | learning rate: 3.127E-05 | global batch size: 256 | lm loss: 1.898545E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.530 | TFLOPs: 40.25 | 15: iteration 105450/ 125429 | consumed samples: 26995200 | consumed tokens: 55286169600 | elapsed time per iteration (s): 1.05 | learning rate: 3.125E-05 | global batch size: 256 | lm loss: 1.892959E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.335 | TFLOPs: 40.38 | 15: iteration 105460/ 125429 | consumed samples: 26997760 | consumed tokens: 55291412480 | elapsed time per iteration (s): 1.04 | learning rate: 3.124E-05 | global batch size: 256 | lm loss: 1.932426E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.254 | TFLOPs: 40.53 | 15: iteration 105470/ 125429 | consumed samples: 27000320 | consumed tokens: 55296655360 | elapsed time per iteration (s): 1.04 | learning rate: 3.123E-05 | global batch size: 256 | lm loss: 1.905280E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.845 | TFLOPs: 40.63 | 15: iteration 105480/ 125429 | consumed samples: 27002880 | consumed tokens: 55301898240 | elapsed time per iteration (s): 1.03 | learning rate: 3.122E-05 | global batch size: 256 | lm loss: 1.898387E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.962 | TFLOPs: 40.98 | 15: iteration 105490/ 125429 | consumed samples: 27005440 | consumed tokens: 55307141120 | elapsed time per iteration (s): 1.05 | learning rate: 3.121E-05 | global batch size: 256 | lm loss: 1.906340E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.322 | TFLOPs: 40.38 | 15: iteration 105500/ 125429 | consumed samples: 27008000 | consumed tokens: 55312384000 | elapsed time per iteration (s): 1.07 | learning rate: 3.120E-05 | global batch size: 256 | lm loss: 1.900931E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.574 | TFLOPs: 39.59 | 15: iteration 105510/ 125429 | consumed samples: 27010560 | consumed tokens: 55317626880 | elapsed time per iteration (s): 1.03 | learning rate: 3.119E-05 | global batch size: 256 | lm loss: 1.883319E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.074 | TFLOPs: 41.16 | 15: iteration 105520/ 125429 | consumed samples: 27013120 | consumed tokens: 55322869760 | elapsed time per iteration (s): 1.04 | learning rate: 3.118E-05 | global batch size: 256 | lm loss: 1.923182E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.485 | TFLOPs: 40.57 | 15: iteration 105530/ 125429 | consumed samples: 27015680 | consumed tokens: 55328112640 | elapsed time per iteration (s): 1.03 | learning rate: 3.117E-05 | global batch size: 256 | lm loss: 1.880567E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.664 | TFLOPs: 41.26 | 15: iteration 105540/ 125429 | consumed samples: 27018240 | consumed tokens: 55333355520 | elapsed time per iteration (s): 1.18 | learning rate: 3.116E-05 | global batch size: 256 | lm loss: 1.922120E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.255 | TFLOPs: 35.90 | 15: iteration 105550/ 125429 | consumed samples: 27020800 | consumed tokens: 55338598400 | elapsed time per iteration (s): 1.03 | learning rate: 3.114E-05 | global batch size: 256 | lm loss: 1.885126E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.477 | TFLOPs: 40.90 | 15: iteration 105560/ 125429 | consumed samples: 27023360 | consumed tokens: 55343841280 | elapsed time per iteration (s): 1.05 | learning rate: 3.113E-05 | global batch size: 256 | lm loss: 1.904675E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.226 | TFLOPs: 40.36 | 15: iteration 105570/ 125429 | consumed samples: 27025920 | consumed tokens: 55349084160 | elapsed time per iteration (s): 1.04 | learning rate: 3.112E-05 | global batch size: 256 | lm loss: 1.910846E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.902 | TFLOPs: 40.64 | 15: iteration 105580/ 125429 | consumed samples: 27028480 | consumed tokens: 55354327040 | elapsed time per iteration (s): 1.03 | learning rate: 3.111E-05 | global batch size: 256 | lm loss: 1.932333E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.682 | TFLOPs: 40.93 | 15: iteration 105590/ 125429 | consumed samples: 27031040 | consumed tokens: 55359569920 | elapsed time per iteration (s): 1.02 | learning rate: 3.110E-05 | global batch size: 256 | lm loss: 1.921040E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.268 | TFLOPs: 41.36 | 15: iteration 105600/ 125429 | consumed samples: 27033600 | consumed tokens: 55364812800 | elapsed time per iteration (s): 1.05 | learning rate: 3.109E-05 | global batch size: 256 | lm loss: 1.889619E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.145 | TFLOPs: 40.35 | 15: iteration 105610/ 125429 | consumed samples: 27036160 | consumed tokens: 55370055680 | elapsed time per iteration (s): 1.03 | learning rate: 3.108E-05 | global batch size: 256 | lm loss: 1.920670E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.666 | TFLOPs: 41.09 | 15: iteration 105620/ 125429 | consumed samples: 27038720 | consumed tokens: 55375298560 | elapsed time per iteration (s): 1.04 | learning rate: 3.107E-05 | global batch size: 256 | lm loss: 1.904588E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.651 | TFLOPs: 40.76 | 15: iteration 105630/ 125429 | consumed samples: 27041280 | consumed tokens: 55380541440 | elapsed time per iteration (s): 1.06 | learning rate: 3.106E-05 | global batch size: 256 | lm loss: 1.901883E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.778 | TFLOPs: 39.96 | 15: iteration 105640/ 125429 | consumed samples: 27043840 | consumed tokens: 55385784320 | elapsed time per iteration (s): 1.08 | learning rate: 3.105E-05 | global batch size: 256 | lm loss: 1.902985E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.787 | TFLOPs: 39.30 | 15: iteration 105650/ 125429 | consumed samples: 27046400 | consumed tokens: 55391027200 | elapsed time per iteration (s): 1.03 | learning rate: 3.104E-05 | global batch size: 256 | lm loss: 1.903042E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.256 | TFLOPs: 41.19 | 15: iteration 105660/ 125429 | consumed samples: 27048960 | consumed tokens: 55396270080 | elapsed time per iteration (s): 1.04 | learning rate: 3.102E-05 | global batch size: 256 | lm loss: 1.898786E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.778 | TFLOPs: 40.62 | 15: iteration 105670/ 125429 | consumed samples: 27051520 | consumed tokens: 55401512960 | elapsed time per iteration (s): 1.03 | learning rate: 3.101E-05 | global batch size: 256 | lm loss: 1.928547E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.551 | TFLOPs: 41.24 | 15: iteration 105680/ 125429 | consumed samples: 27054080 | consumed tokens: 55406755840 | elapsed time per iteration (s): 1.10 | learning rate: 3.100E-05 | global batch size: 256 | lm loss: 1.905344E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.229 | TFLOPs: 38.54 | 15: iteration 105690/ 125429 | consumed samples: 27056640 | consumed tokens: 55411998720 | elapsed time per iteration (s): 1.05 | learning rate: 3.099E-05 | global batch size: 256 | lm loss: 1.902484E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.710 | TFLOPs: 40.27 | 15: iteration 105700/ 125429 | consumed samples: 27059200 | consumed tokens: 55417241600 | elapsed time per iteration (s): 1.05 | learning rate: 3.098E-05 | global batch size: 256 | lm loss: 1.896605E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.753 | TFLOPs: 40.28 | 15: iteration 105710/ 125429 | consumed samples: 27061760 | consumed tokens: 55422484480 | elapsed time per iteration (s): 1.02 | learning rate: 3.097E-05 | global batch size: 256 | lm loss: 1.912814E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.458 | TFLOPs: 41.39 | 15: iteration 105720/ 125429 | consumed samples: 27064320 | consumed tokens: 55427727360 | elapsed time per iteration (s): 1.04 | learning rate: 3.096E-05 | global batch size: 256 | lm loss: 1.902805E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.902 | TFLOPs: 40.80 | 15: iteration 105730/ 125429 | consumed samples: 27066880 | consumed tokens: 55432970240 | elapsed time per iteration (s): 1.06 | learning rate: 3.095E-05 | global batch size: 256 | lm loss: 1.886568E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.070 | TFLOPs: 39.84 | 15: iteration 105740/ 125429 | consumed samples: 27069440 | consumed tokens: 55438213120 | elapsed time per iteration (s): 1.06 | learning rate: 3.094E-05 | global batch size: 256 | lm loss: 1.884242E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.144 | TFLOPs: 39.85 | 15: iteration 105750/ 125429 | consumed samples: 27072000 | consumed tokens: 55443456000 | elapsed time per iteration (s): 1.19 | learning rate: 3.093E-05 | global batch size: 256 | lm loss: 1.920475E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.076 | TFLOPs: 35.54 | 15: iteration 105760/ 125429 | consumed samples: 27074560 | consumed tokens: 55448698880 | elapsed time per iteration (s): 1.05 | learning rate: 3.092E-05 | global batch size: 256 | lm loss: 1.885064E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.867 | TFLOPs: 40.30 | 15: iteration 105770/ 125429 | consumed samples: 27077120 | consumed tokens: 55453941760 | elapsed time per iteration (s): 1.02 | learning rate: 3.090E-05 | global batch size: 256 | lm loss: 1.903563E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.264 | TFLOPs: 41.36 | 15: iteration 105780/ 125429 | consumed samples: 27079680 | consumed tokens: 55459184640 | elapsed time per iteration (s): 1.04 | learning rate: 3.089E-05 | global batch size: 256 | lm loss: 1.898444E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.898 | TFLOPs: 40.80 | 15: iteration 105790/ 125429 | consumed samples: 27082240 | consumed tokens: 55464427520 | elapsed time per iteration (s): 1.03 | learning rate: 3.088E-05 | global batch size: 256 | lm loss: 1.904030E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.811 | TFLOPs: 41.12 | 15: iteration 105800/ 125429 | consumed samples: 27084800 | consumed tokens: 55469670400 | elapsed time per iteration (s): 1.03 | learning rate: 3.087E-05 | global batch size: 256 | lm loss: 1.889413E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.687 | TFLOPs: 41.10 | 15: iteration 105810/ 125429 | consumed samples: 27087360 | consumed tokens: 55474913280 | elapsed time per iteration (s): 1.04 | learning rate: 3.086E-05 | global batch size: 256 | lm loss: 1.896435E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.850 | TFLOPs: 40.63 | 15: iteration 105820/ 125429 | consumed samples: 27089920 | consumed tokens: 55480156160 | elapsed time per iteration (s): 1.03 | learning rate: 3.085E-05 | global batch size: 256 | lm loss: 1.911160E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.740 | TFLOPs: 40.94 | 15: iteration 105830/ 125429 | consumed samples: 27092480 | consumed tokens: 55485399040 | elapsed time per iteration (s): 1.19 | learning rate: 3.084E-05 | global batch size: 256 | lm loss: 1.885952E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.683 | TFLOPs: 35.64 | 15: iteration 105840/ 125429 | consumed samples: 27095040 | consumed tokens: 55490641920 | elapsed time per iteration (s): 1.02 | learning rate: 3.083E-05 | global batch size: 256 | lm loss: 1.914943E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.061 | TFLOPs: 41.32 | 15: iteration 105850/ 125429 | consumed samples: 27097600 | consumed tokens: 55495884800 | elapsed time per iteration (s): 1.02 | learning rate: 3.082E-05 | global batch size: 256 | lm loss: 1.894377E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.412 | TFLOPs: 41.38 | 15: iteration 105860/ 125429 | consumed samples: 27100160 | consumed tokens: 55501127680 | elapsed time per iteration (s): 1.06 | learning rate: 3.081E-05 | global batch size: 256 | lm loss: 1.887247E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.009 | TFLOPs: 39.99 | 15: iteration 105870/ 125429 | consumed samples: 27102720 | consumed tokens: 55506370560 | elapsed time per iteration (s): 1.07 | learning rate: 3.080E-05 | global batch size: 256 | lm loss: 1.899066E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.839 | TFLOPs: 39.64 | 15: iteration 105880/ 125429 | consumed samples: 27105280 | consumed tokens: 55511613440 | elapsed time per iteration (s): 1.02 | learning rate: 3.079E-05 | global batch size: 256 | lm loss: 1.907306E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.680 | TFLOPs: 41.59 | 15: iteration 105890/ 125429 | consumed samples: 27107840 | consumed tokens: 55516856320 | elapsed time per iteration (s): 1.07 | learning rate: 3.077E-05 | global batch size: 256 | lm loss: 1.899135E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.776 | TFLOPs: 39.46 | 15: iteration 105900/ 125429 | consumed samples: 27110400 | consumed tokens: 55522099200 | elapsed time per iteration (s): 1.07 | learning rate: 3.076E-05 | global batch size: 256 | lm loss: 1.903920E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.795 | TFLOPs: 39.63 | 15: iteration 105910/ 125429 | consumed samples: 27112960 | consumed tokens: 55527342080 | elapsed time per iteration (s): 1.05 | learning rate: 3.075E-05 | global batch size: 256 | lm loss: 1.900843E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.824 | TFLOPs: 40.46 | 15: iteration 105920/ 125429 | consumed samples: 27115520 | consumed tokens: 55532584960 | elapsed time per iteration (s): 1.06 | learning rate: 3.074E-05 | global batch size: 256 | lm loss: 1.890066E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.254 | TFLOPs: 40.03 | 15: iteration 105930/ 125429 | consumed samples: 27118080 | consumed tokens: 55537827840 | elapsed time per iteration (s): 1.04 | learning rate: 3.073E-05 | global batch size: 256 | lm loss: 1.886658E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.260 | TFLOPs: 40.86 | 15: iteration 105940/ 125429 | consumed samples: 27120640 | consumed tokens: 55543070720 | elapsed time per iteration (s): 1.06 | learning rate: 3.072E-05 | global batch size: 256 | lm loss: 1.906080E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.971 | TFLOPs: 39.82 | 15: iteration 105950/ 125429 | consumed samples: 27123200 | consumed tokens: 55548313600 | elapsed time per iteration (s): 1.07 | learning rate: 3.071E-05 | global batch size: 256 | lm loss: 1.919097E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.256 | TFLOPs: 39.70 | 15: iteration 105960/ 125429 | consumed samples: 27125760 | consumed tokens: 55553556480 | elapsed time per iteration (s): 1.08 | learning rate: 3.070E-05 | global batch size: 256 | lm loss: 1.874585E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.611 | TFLOPs: 39.10 | 15: iteration 105970/ 125429 | consumed samples: 27128320 | consumed tokens: 55558799360 | elapsed time per iteration (s): 1.05 | learning rate: 3.069E-05 | global batch size: 256 | lm loss: 1.920512E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.817 | TFLOPs: 40.46 | 15: iteration 105980/ 125429 | consumed samples: 27130880 | consumed tokens: 55564042240 | elapsed time per iteration (s): 1.04 | learning rate: 3.068E-05 | global batch size: 256 | lm loss: 1.900249E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.420 | TFLOPs: 40.72 | 15: iteration 105990/ 125429 | consumed samples: 27133440 | consumed tokens: 55569285120 | elapsed time per iteration (s): 1.03 | learning rate: 3.067E-05 | global batch size: 256 | lm loss: 1.888180E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.538 | TFLOPs: 41.24 | 0: [2022-11-27 03:24:23,454] [INFO] [logging.py:68:log_dist] [Rank 0] step=106000, skipped=0, lr=[3.065581769402972e-05, 3.065581769402972e-05, 3.065581769402972e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 106000/ 125429 | consumed samples: 27136000 | consumed tokens: 55574528000 | elapsed time per iteration (s): 1.03 | learning rate: 3.066E-05 | global batch size: 256 | lm loss: 1.902666E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.914 | TFLOPs: 41.14 | 0: steps: 106000 loss: 1.8700 iter time (s): 1.048 samples/sec: 244.240 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 106000 | lm loss value: 1.829319E+00 | lm loss PPL: 6.229642E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 106000 to checkpoints_1b5 0: [2022-11-27 03:24:23,802] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step106000 is begin to save! 0: [2022-11-27 03:24:23,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:24:24,069] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:24:24,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:24:24,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:24:24,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:24:24,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:24:24,286] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:24:24,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:24:24,395] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:24:24,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:24:24,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:24:24,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:24:24,619] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:24:24,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:24:24,726] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:24:24,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:24:24,836] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:24:24,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:24:24,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:24:25,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:24:25,067] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:24:25,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:24:25,183] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:24:25,296] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:24:25,296] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:24:25,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:24:25,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:24:25,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:24:25,523] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:24:25,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:24:25,642] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:24:25,757] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:24:25,757] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:24:25,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:24:25,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:24:25,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:24:25,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:24:26,108] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:24:26,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:24:26,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:24:26,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:24:26,337] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:24:26,338] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:24:26,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:24:26,450] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:24:26,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:24:26,562] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:24:26,673] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:24:26,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:24:26,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:24:26,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:24:26,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:24:26,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:24:27,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:24:27,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_29-model_00-model_states.pt... 0: [2022-11-27 03:24:27,113] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_29-model_00-model_states.pt. 0: [2022-11-27 03:24:27,114] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:24:27,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:24:27,229] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/layer_32-model_00-model_states.pt... 0: [2022-11-27 03:24:27,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/layer_32-model_00-model_states.pt. 0: [2022-11-27 03:24:27,231] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step106000/mp_rank_00_model_states.pt 0: [2022-11-27 03:24:27,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:24:27,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:24:27,274] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step106000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:24:27,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:27,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-27 03:24:27,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,441] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:27,441] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,443] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-27 03:24:27,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,447] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,447] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-27 03:24:27,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:27,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:27,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-27 03:24:27,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,455] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,455] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,455] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:27,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,456] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,456] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:27,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:27,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,460] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,460] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:27,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,450] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,453] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:27,453] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,457] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:27,457] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,458] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,458] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,458] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:27,461] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,461] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,467] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:27,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 12: [2022-11-27 03:24:27,467] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:24:27,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:27,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-27 03:24:27,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:27,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,472] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,472] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:27,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:27,466] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,466] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,468] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:27,468] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,468] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 15: [2022-11-27 03:24:27,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:24:27,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 03:24:27,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-27 03:24:27,473] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,473] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:27,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 12: [2022-11-27 03:24:27,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:27,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:24:27,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,474] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,474] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,474] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:27,475] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 6: [2022-11-27 03:24:27,475] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:24:27,475] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:24:27,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 7: [2022-11-27 03:24:27,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:24:27,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 03:24:27,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-27 03:24:27,478] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,478] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,478] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-27 03:24:27,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-27 03:24:27,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,479] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,479] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-27 03:24:27,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 10: [2022-11-27 03:24:27,480] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:24:27,480] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 03:24:27,480] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:24:27,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,484] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:27,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:27,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:27,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 2: [2022-11-27 03:24:27,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:24:27,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 03:24:27,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:27,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:27,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 3: [2022-11-27 03:24:27,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:24:27,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 9: [2022-11-27 03:24:27,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:24:27,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 03:24:27,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 1: [2022-11-27 03:24:27,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:24:27,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 03:24:27,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-27 03:24:27,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,465] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,465] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,470] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,470] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,470] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,471] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,471] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 8: [2022-11-27 03:24:27,469] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:24:27,469] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:24:27,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 13: [2022-11-27 03:24:27,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 03:24:27,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 11: [2022-11-27 03:24:27,482] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 03:24:27,482] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-27 03:24:27,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-27 03:24:27,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-27 03:24:27,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:24:27,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 14: [2022-11-27 03:24:27,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:24:27,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 03:24:27,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 4: [2022-11-27 03:24:27,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:24:27,522] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 03:24:27,522] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: [2022-11-27 03:24:27,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 03:24:27,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step106000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 5: [2022-11-27 03:24:27,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step106000 is ready now! 0: successfully saved checkpoint at iteration 106000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3890.03 15: iteration 106010/ 125429 | consumed samples: 27138560 | consumed tokens: 55579770880 | elapsed time per iteration (s): 1.44 | learning rate: 3.065E-05 | global batch size: 256 | lm loss: 1.891925E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.775 | TFLOPs: 29.38 | 15: iteration 106020/ 125429 | consumed samples: 27141120 | consumed tokens: 55585013760 | elapsed time per iteration (s): 1.05 | learning rate: 3.063E-05 | global batch size: 256 | lm loss: 1.881874E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.671 | TFLOPs: 40.43 | 15: iteration 106030/ 125429 | consumed samples: 27143680 | consumed tokens: 55590256640 | elapsed time per iteration (s): 1.03 | learning rate: 3.062E-05 | global batch size: 256 | lm loss: 1.888279E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.927 | TFLOPs: 41.14 | 15: iteration 106040/ 125429 | consumed samples: 27146240 | consumed tokens: 55595499520 | elapsed time per iteration (s): 1.06 | learning rate: 3.061E-05 | global batch size: 256 | lm loss: 1.899833E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.123 | TFLOPs: 39.85 | 15: iteration 106050/ 125429 | consumed samples: 27148800 | consumed tokens: 55600742400 | elapsed time per iteration (s): 1.05 | learning rate: 3.060E-05 | global batch size: 256 | lm loss: 1.906110E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.640 | TFLOPs: 40.43 | 15: iteration 106060/ 125429 | consumed samples: 27151360 | consumed tokens: 55605985280 | elapsed time per iteration (s): 1.03 | learning rate: 3.059E-05 | global batch size: 256 | lm loss: 1.898148E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.304 | TFLOPs: 41.20 | 15: iteration 106070/ 125429 | consumed samples: 27153920 | consumed tokens: 55611228160 | elapsed time per iteration (s): 1.08 | learning rate: 3.058E-05 | global batch size: 256 | lm loss: 1.903535E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.706 | TFLOPs: 39.28 | 15: iteration 106080/ 125429 | consumed samples: 27156480 | consumed tokens: 55616471040 | elapsed time per iteration (s): 1.08 | learning rate: 3.057E-05 | global batch size: 256 | lm loss: 1.893262E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.453 | TFLOPs: 39.24 | 15: iteration 106090/ 125429 | consumed samples: 27159040 | consumed tokens: 55621713920 | elapsed time per iteration (s): 1.15 | learning rate: 3.056E-05 | global batch size: 256 | lm loss: 1.897776E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 222.813 | TFLOPs: 36.82 | 15: iteration 106100/ 125429 | consumed samples: 27161600 | consumed tokens: 55626956800 | elapsed time per iteration (s): 1.19 | learning rate: 3.055E-05 | global batch size: 256 | lm loss: 1.897748E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.005 | TFLOPs: 35.53 | 15: iteration 106110/ 125429 | consumed samples: 27164160 | consumed tokens: 55632199680 | elapsed time per iteration (s): 1.12 | learning rate: 3.054E-05 | global batch size: 256 | lm loss: 1.892416E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.997 | TFLOPs: 37.68 | 15: iteration 106120/ 125429 | consumed samples: 27166720 | consumed tokens: 55637442560 | elapsed time per iteration (s): 1.07 | learning rate: 3.053E-05 | global batch size: 256 | lm loss: 1.908548E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.144 | TFLOPs: 39.52 | 15: iteration 106130/ 125429 | consumed samples: 27169280 | consumed tokens: 55642685440 | elapsed time per iteration (s): 1.04 | learning rate: 3.052E-05 | global batch size: 256 | lm loss: 1.905203E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.251 | TFLOPs: 40.69 | 15: iteration 106140/ 125429 | consumed samples: 27171840 | consumed tokens: 55647928320 | elapsed time per iteration (s): 1.05 | learning rate: 3.051E-05 | global batch size: 256 | lm loss: 1.894854E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.332 | TFLOPs: 40.38 | 15: iteration 106150/ 125429 | consumed samples: 27174400 | consumed tokens: 55653171200 | elapsed time per iteration (s): 1.06 | learning rate: 3.050E-05 | global batch size: 256 | lm loss: 1.890388E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.461 | TFLOPs: 40.07 | 15: iteration 106160/ 125429 | consumed samples: 27176960 | consumed tokens: 55658414080 | elapsed time per iteration (s): 1.04 | learning rate: 3.048E-05 | global batch size: 256 | lm loss: 1.924836E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.549 | TFLOPs: 40.58 | 15: iteration 106170/ 125429 | consumed samples: 27179520 | consumed tokens: 55663656960 | elapsed time per iteration (s): 1.18 | learning rate: 3.047E-05 | global batch size: 256 | lm loss: 1.909755E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.600 | TFLOPs: 35.79 | 15: iteration 106180/ 125429 | consumed samples: 27182080 | consumed tokens: 55668899840 | elapsed time per iteration (s): 1.08 | learning rate: 3.046E-05 | global batch size: 256 | lm loss: 1.902834E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.172 | TFLOPs: 39.19 | 15: iteration 106190/ 125429 | consumed samples: 27184640 | consumed tokens: 55674142720 | elapsed time per iteration (s): 1.08 | learning rate: 3.045E-05 | global batch size: 256 | lm loss: 1.913653E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.081 | TFLOPs: 39.34 | 15: iteration 106200/ 125429 | consumed samples: 27187200 | consumed tokens: 55679385600 | elapsed time per iteration (s): 1.05 | learning rate: 3.044E-05 | global batch size: 256 | lm loss: 1.900549E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.585 | TFLOPs: 40.25 | 15: iteration 106210/ 125429 | consumed samples: 27189760 | consumed tokens: 55684628480 | elapsed time per iteration (s): 1.02 | learning rate: 3.043E-05 | global batch size: 256 | lm loss: 1.908743E+00 | grad norm: 0.146 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.372 | TFLOPs: 41.38 | 15: iteration 106220/ 125429 | consumed samples: 27192320 | consumed tokens: 55689871360 | elapsed time per iteration (s): 1.04 | learning rate: 3.042E-05 | global batch size: 256 | lm loss: 1.888212E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.425 | TFLOPs: 40.72 | 15: iteration 106230/ 125429 | consumed samples: 27194880 | consumed tokens: 55695114240 | elapsed time per iteration (s): 1.05 | learning rate: 3.041E-05 | global batch size: 256 | lm loss: 1.910576E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.827 | TFLOPs: 40.29 | 15: iteration 106240/ 125429 | consumed samples: 27197440 | consumed tokens: 55700357120 | elapsed time per iteration (s): 1.03 | learning rate: 3.040E-05 | global batch size: 256 | lm loss: 1.895441E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.253 | TFLOPs: 41.19 | 15: iteration 106250/ 125429 | consumed samples: 27200000 | consumed tokens: 55705600000 | elapsed time per iteration (s): 1.08 | learning rate: 3.039E-05 | global batch size: 256 | lm loss: 1.886260E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.358 | TFLOPs: 39.23 | 15: iteration 106260/ 125429 | consumed samples: 27202560 | consumed tokens: 55710842880 | elapsed time per iteration (s): 1.06 | learning rate: 3.038E-05 | global batch size: 256 | lm loss: 1.912999E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.521 | TFLOPs: 40.08 | 15: iteration 106270/ 125429 | consumed samples: 27205120 | consumed tokens: 55716085760 | elapsed time per iteration (s): 1.02 | learning rate: 3.037E-05 | global batch size: 256 | lm loss: 1.911666E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.361 | TFLOPs: 41.54 | 15: iteration 106280/ 125429 | consumed samples: 27207680 | consumed tokens: 55721328640 | elapsed time per iteration (s): 1.07 | learning rate: 3.036E-05 | global batch size: 256 | lm loss: 1.906397E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.229 | TFLOPs: 39.37 | 15: iteration 106290/ 125429 | consumed samples: 27210240 | consumed tokens: 55726571520 | elapsed time per iteration (s): 1.05 | learning rate: 3.035E-05 | global batch size: 256 | lm loss: 1.900137E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.189 | TFLOPs: 40.19 | 15: iteration 106300/ 125429 | consumed samples: 27212800 | consumed tokens: 55731814400 | elapsed time per iteration (s): 1.06 | learning rate: 3.034E-05 | global batch size: 256 | lm loss: 1.921566E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.439 | TFLOPs: 40.06 | 15: iteration 106310/ 125429 | consumed samples: 27215360 | consumed tokens: 55737057280 | elapsed time per iteration (s): 1.03 | learning rate: 3.033E-05 | global batch size: 256 | lm loss: 1.909347E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.907 | TFLOPs: 41.13 | 15: iteration 106320/ 125429 | consumed samples: 27217920 | consumed tokens: 55742300160 | elapsed time per iteration (s): 1.04 | learning rate: 3.031E-05 | global batch size: 256 | lm loss: 1.923181E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.967 | TFLOPs: 40.65 | 15: iteration 106330/ 125429 | consumed samples: 27220480 | consumed tokens: 55747543040 | elapsed time per iteration (s): 1.05 | learning rate: 3.030E-05 | global batch size: 256 | lm loss: 1.890683E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.610 | TFLOPs: 40.42 | 15: iteration 106340/ 125429 | consumed samples: 27223040 | consumed tokens: 55752785920 | elapsed time per iteration (s): 1.44 | learning rate: 3.029E-05 | global batch size: 256 | lm loss: 1.885164E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.205 | TFLOPs: 29.45 | 15: iteration 106350/ 125429 | consumed samples: 27225600 | consumed tokens: 55758028800 | elapsed time per iteration (s): 1.10 | learning rate: 3.028E-05 | global batch size: 256 | lm loss: 1.922213E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.455 | TFLOPs: 38.58 | 15: iteration 106360/ 125429 | consumed samples: 27228160 | consumed tokens: 55763271680 | elapsed time per iteration (s): 1.06 | learning rate: 3.027E-05 | global batch size: 256 | lm loss: 1.891027E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.458 | TFLOPs: 39.90 | 15: iteration 106370/ 125429 | consumed samples: 27230720 | consumed tokens: 55768514560 | elapsed time per iteration (s): 1.05 | learning rate: 3.026E-05 | global batch size: 256 | lm loss: 1.882033E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.028 | TFLOPs: 40.33 | 15: iteration 106380/ 125429 | consumed samples: 27233280 | consumed tokens: 55773757440 | elapsed time per iteration (s): 1.03 | learning rate: 3.025E-05 | global batch size: 256 | lm loss: 1.901955E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.746 | TFLOPs: 41.27 | 15: iteration 106390/ 125429 | consumed samples: 27235840 | consumed tokens: 55779000320 | elapsed time per iteration (s): 1.04 | learning rate: 3.024E-05 | global batch size: 256 | lm loss: 1.902696E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.378 | TFLOPs: 40.55 | 15: iteration 106400/ 125429 | consumed samples: 27238400 | consumed tokens: 55784243200 | elapsed time per iteration (s): 1.02 | learning rate: 3.023E-05 | global batch size: 256 | lm loss: 1.927763E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.809 | TFLOPs: 41.28 | 15: iteration 106410/ 125429 | consumed samples: 27240960 | consumed tokens: 55789486080 | elapsed time per iteration (s): 1.03 | learning rate: 3.022E-05 | global batch size: 256 | lm loss: 1.891624E+00 | grad norm: 0.147 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.943 | TFLOPs: 40.97 | 15: iteration 106420/ 125429 | consumed samples: 27243520 | consumed tokens: 55794728960 | elapsed time per iteration (s): 1.19 | learning rate: 3.021E-05 | global batch size: 256 | lm loss: 1.903127E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.212 | TFLOPs: 35.57 | 15: iteration 106430/ 125429 | consumed samples: 27246080 | consumed tokens: 55799971840 | elapsed time per iteration (s): 1.07 | learning rate: 3.020E-05 | global batch size: 256 | lm loss: 1.880859E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.085 | TFLOPs: 39.51 | 15: iteration 106440/ 125429 | consumed samples: 27248640 | consumed tokens: 55805214720 | elapsed time per iteration (s): 1.04 | learning rate: 3.019E-05 | global batch size: 256 | lm loss: 1.860566E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.094 | TFLOPs: 40.67 | 15: iteration 106450/ 125429 | consumed samples: 27251200 | consumed tokens: 55810457600 | elapsed time per iteration (s): 1.04 | learning rate: 3.018E-05 | global batch size: 256 | lm loss: 1.903796E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.592 | TFLOPs: 40.75 | 15: iteration 106460/ 125429 | consumed samples: 27253760 | consumed tokens: 55815700480 | elapsed time per iteration (s): 1.03 | learning rate: 3.017E-05 | global batch size: 256 | lm loss: 1.880414E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.546 | TFLOPs: 40.91 | 15: iteration 106470/ 125429 | consumed samples: 27256320 | consumed tokens: 55820943360 | elapsed time per iteration (s): 1.06 | learning rate: 3.016E-05 | global batch size: 256 | lm loss: 1.891579E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.649 | TFLOPs: 39.77 | 15: iteration 106480/ 125429 | consumed samples: 27258880 | consumed tokens: 55826186240 | elapsed time per iteration (s): 1.03 | learning rate: 3.015E-05 | global batch size: 256 | lm loss: 1.900797E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.870 | TFLOPs: 40.96 | 15: iteration 106490/ 125429 | consumed samples: 27261440 | consumed tokens: 55831429120 | elapsed time per iteration (s): 1.03 | learning rate: 3.014E-05 | global batch size: 256 | lm loss: 1.911828E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.856 | TFLOPs: 40.96 | 15: iteration 106500/ 125429 | consumed samples: 27264000 | consumed tokens: 55836672000 | elapsed time per iteration (s): 1.04 | learning rate: 3.012E-05 | global batch size: 256 | lm loss: 1.913436E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.220 | TFLOPs: 40.52 | 15: iteration 106510/ 125429 | consumed samples: 27266560 | consumed tokens: 55841914880 | elapsed time per iteration (s): 1.03 | learning rate: 3.011E-05 | global batch size: 256 | lm loss: 1.911827E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.625 | TFLOPs: 40.92 | 15: iteration 106520/ 125429 | consumed samples: 27269120 | consumed tokens: 55847157760 | elapsed time per iteration (s): 1.03 | learning rate: 3.010E-05 | global batch size: 256 | lm loss: 1.916259E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.141 | TFLOPs: 41.01 | 15: iteration 106530/ 125429 | consumed samples: 27271680 | consumed tokens: 55852400640 | elapsed time per iteration (s): 1.06 | learning rate: 3.009E-05 | global batch size: 256 | lm loss: 1.880513E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.836 | TFLOPs: 39.97 | 15: iteration 106540/ 125429 | consumed samples: 27274240 | consumed tokens: 55857643520 | elapsed time per iteration (s): 1.03 | learning rate: 3.008E-05 | global batch size: 256 | lm loss: 1.884565E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.654 | TFLOPs: 41.26 | 15: iteration 106550/ 125429 | consumed samples: 27276800 | consumed tokens: 55862886400 | elapsed time per iteration (s): 1.21 | learning rate: 3.007E-05 | global batch size: 256 | lm loss: 1.912586E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.676 | TFLOPs: 34.98 | 15: iteration 106560/ 125429 | consumed samples: 27279360 | consumed tokens: 55868129280 | elapsed time per iteration (s): 1.03 | learning rate: 3.006E-05 | global batch size: 256 | lm loss: 1.901244E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.407 | TFLOPs: 41.22 | 15: iteration 106570/ 125429 | consumed samples: 27281920 | consumed tokens: 55873372160 | elapsed time per iteration (s): 1.03 | learning rate: 3.005E-05 | global batch size: 256 | lm loss: 1.899832E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.504 | TFLOPs: 41.23 | 15: iteration 106580/ 125429 | consumed samples: 27284480 | consumed tokens: 55878615040 | elapsed time per iteration (s): 1.06 | learning rate: 3.004E-05 | global batch size: 256 | lm loss: 1.929490E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.561 | TFLOPs: 39.75 | 15: iteration 106590/ 125429 | consumed samples: 27287040 | consumed tokens: 55883857920 | elapsed time per iteration (s): 1.08 | learning rate: 3.003E-05 | global batch size: 256 | lm loss: 1.894162E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.975 | TFLOPs: 39.16 | 15: iteration 106600/ 125429 | consumed samples: 27289600 | consumed tokens: 55889100800 | elapsed time per iteration (s): 1.05 | learning rate: 3.002E-05 | global batch size: 256 | lm loss: 1.881438E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.457 | TFLOPs: 40.40 | 15: iteration 106610/ 125429 | consumed samples: 27292160 | consumed tokens: 55894343680 | elapsed time per iteration (s): 1.03 | learning rate: 3.001E-05 | global batch size: 256 | lm loss: 1.917172E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.311 | TFLOPs: 41.04 | 15: iteration 106620/ 125429 | consumed samples: 27294720 | consumed tokens: 55899586560 | elapsed time per iteration (s): 1.08 | learning rate: 3.000E-05 | global batch size: 256 | lm loss: 1.891262E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.837 | TFLOPs: 39.30 | 15: iteration 106630/ 125429 | consumed samples: 27297280 | consumed tokens: 55904829440 | elapsed time per iteration (s): 1.04 | learning rate: 2.999E-05 | global batch size: 256 | lm loss: 1.894079E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.337 | TFLOPs: 40.54 | 15: iteration 106640/ 125429 | consumed samples: 27299840 | consumed tokens: 55910072320 | elapsed time per iteration (s): 1.03 | learning rate: 2.998E-05 | global batch size: 256 | lm loss: 1.899129E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.410 | TFLOPs: 41.05 | 15: iteration 106650/ 125429 | consumed samples: 27302400 | consumed tokens: 55915315200 | elapsed time per iteration (s): 1.07 | learning rate: 2.997E-05 | global batch size: 256 | lm loss: 1.887676E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.741 | TFLOPs: 39.45 | 15: iteration 106660/ 125429 | consumed samples: 27304960 | consumed tokens: 55920558080 | elapsed time per iteration (s): 1.04 | learning rate: 2.996E-05 | global batch size: 256 | lm loss: 1.919569E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.500 | TFLOPs: 40.74 | 15: iteration 106670/ 125429 | consumed samples: 27307520 | consumed tokens: 55925800960 | elapsed time per iteration (s): 1.05 | learning rate: 2.995E-05 | global batch size: 256 | lm loss: 1.867116E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.847 | TFLOPs: 40.30 | 15: iteration 106680/ 125429 | consumed samples: 27310080 | consumed tokens: 55931043840 | elapsed time per iteration (s): 1.03 | learning rate: 2.994E-05 | global batch size: 256 | lm loss: 1.887109E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.721 | TFLOPs: 41.27 | 15: iteration 106690/ 125429 | consumed samples: 27312640 | consumed tokens: 55936286720 | elapsed time per iteration (s): 1.06 | learning rate: 2.993E-05 | global batch size: 256 | lm loss: 1.908902E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.348 | TFLOPs: 39.88 | 15: iteration 106700/ 125429 | consumed samples: 27315200 | consumed tokens: 55941529600 | elapsed time per iteration (s): 1.04 | learning rate: 2.992E-05 | global batch size: 256 | lm loss: 1.900019E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.189 | TFLOPs: 40.52 | 15: iteration 106710/ 125429 | consumed samples: 27317760 | consumed tokens: 55946772480 | elapsed time per iteration (s): 1.21 | learning rate: 2.991E-05 | global batch size: 256 | lm loss: 1.878716E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.420 | TFLOPs: 35.10 | 15: iteration 106720/ 125429 | consumed samples: 27320320 | consumed tokens: 55952015360 | elapsed time per iteration (s): 1.04 | learning rate: 2.990E-05 | global batch size: 256 | lm loss: 1.928547E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.525 | TFLOPs: 40.74 | 15: iteration 106730/ 125429 | consumed samples: 27322880 | consumed tokens: 55957258240 | elapsed time per iteration (s): 1.05 | learning rate: 2.988E-05 | global batch size: 256 | lm loss: 1.892878E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.121 | TFLOPs: 40.18 | 15: iteration 106740/ 125429 | consumed samples: 27325440 | consumed tokens: 55962501120 | elapsed time per iteration (s): 1.08 | learning rate: 2.987E-05 | global batch size: 256 | lm loss: 1.903177E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.996 | TFLOPs: 39.17 | 15: iteration 106750/ 125429 | consumed samples: 27328000 | consumed tokens: 55967744000 | elapsed time per iteration (s): 1.09 | learning rate: 2.986E-05 | global batch size: 256 | lm loss: 1.908574E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.912 | TFLOPs: 38.99 | 15: iteration 106760/ 125429 | consumed samples: 27330560 | consumed tokens: 55972986880 | elapsed time per iteration (s): 1.06 | learning rate: 2.985E-05 | global batch size: 256 | lm loss: 1.903531E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.071 | TFLOPs: 40.00 | 15: iteration 106770/ 125429 | consumed samples: 27333120 | consumed tokens: 55978229760 | elapsed time per iteration (s): 1.04 | learning rate: 2.984E-05 | global batch size: 256 | lm loss: 1.892306E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.779 | TFLOPs: 40.78 | 15: iteration 106780/ 125429 | consumed samples: 27335680 | consumed tokens: 55983472640 | elapsed time per iteration (s): 1.02 | learning rate: 2.983E-05 | global batch size: 256 | lm loss: 1.916858E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.806 | TFLOPs: 41.28 | 15: iteration 106790/ 125429 | consumed samples: 27338240 | consumed tokens: 55988715520 | elapsed time per iteration (s): 1.03 | learning rate: 2.982E-05 | global batch size: 256 | lm loss: 1.881717E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.588 | TFLOPs: 41.25 | 15: iteration 106800/ 125429 | consumed samples: 27340800 | consumed tokens: 55993958400 | elapsed time per iteration (s): 1.04 | learning rate: 2.981E-05 | global batch size: 256 | lm loss: 1.888788E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.372 | TFLOPs: 40.71 | 15: iteration 106810/ 125429 | consumed samples: 27343360 | consumed tokens: 55999201280 | elapsed time per iteration (s): 1.21 | learning rate: 2.980E-05 | global batch size: 256 | lm loss: 1.904707E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.315 | TFLOPs: 34.92 | 15: iteration 106820/ 125429 | consumed samples: 27345920 | consumed tokens: 56004444160 | elapsed time per iteration (s): 1.18 | learning rate: 2.979E-05 | global batch size: 256 | lm loss: 1.933507E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.357 | TFLOPs: 35.75 | 15: iteration 106830/ 125429 | consumed samples: 27348480 | consumed tokens: 56009687040 | elapsed time per iteration (s): 1.21 | learning rate: 2.978E-05 | global batch size: 256 | lm loss: 1.921348E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.273 | TFLOPs: 34.91 | 15: iteration 106840/ 125429 | consumed samples: 27351040 | consumed tokens: 56014929920 | elapsed time per iteration (s): 1.03 | learning rate: 2.977E-05 | global batch size: 256 | lm loss: 1.892783E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.717 | TFLOPs: 41.27 | 15: iteration 106850/ 125429 | consumed samples: 27353600 | consumed tokens: 56020172800 | elapsed time per iteration (s): 1.08 | learning rate: 2.976E-05 | global batch size: 256 | lm loss: 1.935929E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.596 | TFLOPs: 39.26 | 15: iteration 106860/ 125429 | consumed samples: 27356160 | consumed tokens: 56025415680 | elapsed time per iteration (s): 1.04 | learning rate: 2.975E-05 | global batch size: 256 | lm loss: 1.899841E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.330 | TFLOPs: 40.87 | 15: iteration 106870/ 125429 | consumed samples: 27358720 | consumed tokens: 56030658560 | elapsed time per iteration (s): 1.07 | learning rate: 2.974E-05 | global batch size: 256 | lm loss: 1.911557E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.491 | TFLOPs: 39.41 | 15: iteration 106880/ 125429 | consumed samples: 27361280 | consumed tokens: 56035901440 | elapsed time per iteration (s): 1.06 | learning rate: 2.973E-05 | global batch size: 256 | lm loss: 1.882962E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.432 | TFLOPs: 39.73 | 15: iteration 106890/ 125429 | consumed samples: 27363840 | consumed tokens: 56041144320 | elapsed time per iteration (s): 1.06 | learning rate: 2.972E-05 | global batch size: 256 | lm loss: 1.875860E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.604 | TFLOPs: 39.76 | 15: iteration 106900/ 125429 | consumed samples: 27366400 | consumed tokens: 56046387200 | elapsed time per iteration (s): 1.04 | learning rate: 2.971E-05 | global batch size: 256 | lm loss: 1.891842E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.427 | TFLOPs: 40.56 | 15: iteration 106910/ 125429 | consumed samples: 27368960 | consumed tokens: 56051630080 | elapsed time per iteration (s): 1.08 | learning rate: 2.970E-05 | global batch size: 256 | lm loss: 1.879455E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.145 | TFLOPs: 39.02 | 15: iteration 106920/ 125429 | consumed samples: 27371520 | consumed tokens: 56056872960 | elapsed time per iteration (s): 1.06 | learning rate: 2.969E-05 | global batch size: 256 | lm loss: 1.880694E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.607 | TFLOPs: 40.09 | 15: iteration 106930/ 125429 | consumed samples: 27374080 | consumed tokens: 56062115840 | elapsed time per iteration (s): 1.05 | learning rate: 2.968E-05 | global batch size: 256 | lm loss: 1.901052E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.739 | TFLOPs: 40.44 | 15: iteration 106940/ 125429 | consumed samples: 27376640 | consumed tokens: 56067358720 | elapsed time per iteration (s): 1.05 | learning rate: 2.967E-05 | global batch size: 256 | lm loss: 1.938670E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.247 | TFLOPs: 40.20 | 15: iteration 106950/ 125429 | consumed samples: 27379200 | consumed tokens: 56072601600 | elapsed time per iteration (s): 1.02 | learning rate: 2.966E-05 | global batch size: 256 | lm loss: 1.890493E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.368 | TFLOPs: 41.38 | 15: iteration 106960/ 125429 | consumed samples: 27381760 | consumed tokens: 56077844480 | elapsed time per iteration (s): 1.03 | learning rate: 2.965E-05 | global batch size: 256 | lm loss: 1.933706E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.247 | TFLOPs: 41.19 | 15: iteration 106970/ 125429 | consumed samples: 27384320 | consumed tokens: 56083087360 | elapsed time per iteration (s): 1.03 | learning rate: 2.964E-05 | global batch size: 256 | lm loss: 1.892912E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.936 | TFLOPs: 40.97 | 15: iteration 106980/ 125429 | consumed samples: 27386880 | consumed tokens: 56088330240 | elapsed time per iteration (s): 1.04 | learning rate: 2.963E-05 | global batch size: 256 | lm loss: 1.870398E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.492 | TFLOPs: 40.57 | 15: iteration 106990/ 125429 | consumed samples: 27389440 | consumed tokens: 56093573120 | elapsed time per iteration (s): 1.07 | learning rate: 2.962E-05 | global batch size: 256 | lm loss: 1.916856E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.131 | TFLOPs: 39.52 | 15: iteration 107000/ 125429 | consumed samples: 27392000 | consumed tokens: 56098816000 | elapsed time per iteration (s): 1.05 | learning rate: 2.961E-05 | global batch size: 256 | lm loss: 1.910253E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.869 | TFLOPs: 40.14 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 107000 | lm loss value: 1.861541E+00 | lm loss PPL: 6.433646E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 107000 to checkpoints_1b5 0: [2022-11-27 03:42:13,941] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step107000 is begin to save! 0: [2022-11-27 03:42:13,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:42:14,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:42:14,194] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:42:14,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:42:14,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:42:14,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:42:14,396] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:42:14,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:42:14,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_06-model_00-model_states.pt... 0: [2022-11-27 03:42:14,603] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_06-model_00-model_states.pt. 0: [2022-11-27 03:42:14,603] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_07-model_00-model_states.pt... 0: [2022-11-27 03:42:14,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_07-model_00-model_states.pt. 0: [2022-11-27 03:42:14,710] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_08-model_00-model_states.pt... 0: [2022-11-27 03:42:14,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_08-model_00-model_states.pt. 0: [2022-11-27 03:42:14,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_09-model_00-model_states.pt... 0: [2022-11-27 03:42:14,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_09-model_00-model_states.pt. 0: [2022-11-27 03:42:14,916] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_10-model_00-model_states.pt... 0: [2022-11-27 03:42:15,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_10-model_00-model_states.pt. 0: [2022-11-27 03:42:15,020] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_11-model_00-model_states.pt... 0: [2022-11-27 03:42:15,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_11-model_00-model_states.pt. 0: [2022-11-27 03:42:15,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_12-model_00-model_states.pt... 0: [2022-11-27 03:42:15,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_12-model_00-model_states.pt. 0: [2022-11-27 03:42:15,226] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_13-model_00-model_states.pt... 0: [2022-11-27 03:42:15,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_13-model_00-model_states.pt. 0: [2022-11-27 03:42:15,332] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_14-model_00-model_states.pt... 0: [2022-11-27 03:42:15,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_14-model_00-model_states.pt. 0: [2022-11-27 03:42:15,439] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_15-model_00-model_states.pt... 0: [2022-11-27 03:42:15,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_15-model_00-model_states.pt. 0: [2022-11-27 03:42:15,546] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_16-model_00-model_states.pt... 0: [2022-11-27 03:42:15,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_16-model_00-model_states.pt. 0: [2022-11-27 03:42:15,656] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_17-model_00-model_states.pt... 0: [2022-11-27 03:42:15,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_17-model_00-model_states.pt. 0: [2022-11-27 03:42:15,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_18-model_00-model_states.pt... 0: [2022-11-27 03:42:15,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_18-model_00-model_states.pt. 0: [2022-11-27 03:42:15,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_19-model_00-model_states.pt... 0: [2022-11-27 03:42:15,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_19-model_00-model_states.pt. 0: [2022-11-27 03:42:15,984] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_20-model_00-model_states.pt... 0: [2022-11-27 03:42:16,093] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_20-model_00-model_states.pt. 0: [2022-11-27 03:42:16,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_21-model_00-model_states.pt... 0: [2022-11-27 03:42:16,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_21-model_00-model_states.pt. 0: [2022-11-27 03:42:16,204] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_22-model_00-model_states.pt... 0: [2022-11-27 03:42:16,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_22-model_00-model_states.pt. 0: [2022-11-27 03:42:16,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_23-model_00-model_states.pt... 0: [2022-11-27 03:42:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_23-model_00-model_states.pt. 0: [2022-11-27 03:42:16,423] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_24-model_00-model_states.pt... 0: [2022-11-27 03:42:16,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_24-model_00-model_states.pt. 0: [2022-11-27 03:42:16,530] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_25-model_00-model_states.pt... 0: [2022-11-27 03:42:16,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_25-model_00-model_states.pt. 0: [2022-11-27 03:42:16,641] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_26-model_00-model_states.pt... 0: [2022-11-27 03:42:16,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_26-model_00-model_states.pt. 0: [2022-11-27 03:42:16,753] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_27-model_00-model_states.pt... 0: [2022-11-27 03:42:16,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_27-model_00-model_states.pt. 0: [2022-11-27 03:42:16,863] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_28-model_00-model_states.pt... 0: [2022-11-27 03:42:16,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_28-model_00-model_states.pt. 0: [2022-11-27 03:42:16,972] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_29-model_00-model_states.pt... 0: [2022-11-27 03:42:17,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_29-model_00-model_states.pt. 0: [2022-11-27 03:42:17,075] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_30-model_00-model_states.pt... 0: [2022-11-27 03:42:17,182] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_30-model_00-model_states.pt. 0: [2022-11-27 03:42:17,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/layer_32-model_00-model_states.pt... 0: [2022-11-27 03:42:17,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/layer_32-model_00-model_states.pt. 0: [2022-11-27 03:42:17,187] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step107000/mp_rank_00_model_states.pt 0: [2022-11-27 03:42:17,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/mp_rank_00_model_states.pt... 0: [2022-11-27 03:42:17,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/mp_rank_00_model_states.pt. 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 1: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 15: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 14: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 13: [2022-11-27 03:42:17,231] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step107000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 6: [2022-11-27 03:42:17,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-27 03:42:17,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-27 03:42:17,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,397] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,397] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:42:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 03:42:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:42:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:42:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:42:17,402] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,402] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 03:42:17,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-27 03:42:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,395] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,395] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,395] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:42:17,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 9: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:42:17,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-27 03:42:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 03:42:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 9: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 03:42:17,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-27 03:42:17,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-27 03:42:17,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 03:42:17,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,410] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,410] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 03:42:17,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-27 03:42:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:42:17,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,414] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 3: [2022-11-27 03:42:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:42:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:42:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:42:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 03:42:17,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-27 03:42:17,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 03:42:17,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-27 03:42:17,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-27 03:42:17,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 03:42:17,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 03:42:17,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 5: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 2: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 03:42:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-27 03:42:17,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 9: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:42:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:42:17,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 03:42:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 12: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 03:42:17,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 03:42:17,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:42:17,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:42:17,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:42:17,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 1: [2022-11-27 03:42:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 03:42:17,430] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 03:42:17,430] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,419] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,419] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 13: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 03:42:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 03:42:17,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-27 03:42:17,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-27 03:42:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 03:42:17,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 03:42:17,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 9: [2022-11-27 03:42:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 03:42:17,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 03:42:17,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 03:42:17,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 9: [2022-11-27 03:42:17,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 11: [2022-11-27 03:42:17,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 03:42:17,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 4: [2022-11-27 03:42:17,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 03:42:17,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 03:42:17,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 6: [2022-11-27 03:42:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-27 03:42:17,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 03:42:17,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 03:42:17,439] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 03:42:17,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 10: [2022-11-27 03:42:17,439] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 7: [2022-11-27 03:42:17,442] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 03:42:17,443] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 03:42:17,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 14: [2022-11-27 03:42:17,433] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 03:42:17,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:42:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,438] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,438] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 15: [2022-11-27 03:42:17,446] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 03:42:17,446] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 03:42:17,446] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 3: [2022-11-27 03:42:17,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 03:42:17,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 03:42:17,483] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,549] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: [2022-11-27 03:42:17,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 03:42:17,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 8: [2022-11-27 03:42:17,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 03:42:17,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step107000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 03:42:17,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step107000 is ready now! 0: successfully saved checkpoint at iteration 107000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3724.94 15: iteration 107010/ 125429 | consumed samples: 27394560 | consumed tokens: 56104058880 | elapsed time per iteration (s): 1.44 | learning rate: 2.960E-05 | global batch size: 256 | lm loss: 1.873787E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.888 | TFLOPs: 29.40 | 15: iteration 107020/ 125429 | consumed samples: 27397120 | consumed tokens: 56109301760 | elapsed time per iteration (s): 1.07 | learning rate: 2.959E-05 | global batch size: 256 | lm loss: 1.922894E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.737 | TFLOPs: 39.62 | 15: iteration 107030/ 125429 | consumed samples: 27399680 | consumed tokens: 56114544640 | elapsed time per iteration (s): 1.06 | learning rate: 2.958E-05 | global batch size: 256 | lm loss: 1.909953E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.488 | TFLOPs: 40.07 | 15: iteration 107040/ 125429 | consumed samples: 27402240 | consumed tokens: 56119787520 | elapsed time per iteration (s): 1.05 | learning rate: 2.957E-05 | global batch size: 256 | lm loss: 1.890860E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.564 | TFLOPs: 40.25 | 15: iteration 107050/ 125429 | consumed samples: 27404800 | consumed tokens: 56125030400 | elapsed time per iteration (s): 1.03 | learning rate: 2.956E-05 | global batch size: 256 | lm loss: 1.891727E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.700 | TFLOPs: 41.10 | 15: iteration 107060/ 125429 | consumed samples: 27407360 | consumed tokens: 56130273280 | elapsed time per iteration (s): 1.08 | learning rate: 2.955E-05 | global batch size: 256 | lm loss: 1.915498E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.102 | TFLOPs: 39.18 | 15: iteration 107070/ 125429 | consumed samples: 27409920 | consumed tokens: 56135516160 | elapsed time per iteration (s): 1.07 | learning rate: 2.954E-05 | global batch size: 256 | lm loss: 1.883388E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.366 | TFLOPs: 39.39 | 15: iteration 107080/ 125429 | consumed samples: 27412480 | consumed tokens: 56140759040 | elapsed time per iteration (s): 1.07 | learning rate: 2.952E-05 | global batch size: 256 | lm loss: 1.910084E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.062 | TFLOPs: 39.67 | 15: iteration 107090/ 125429 | consumed samples: 27415040 | consumed tokens: 56146001920 | elapsed time per iteration (s): 1.07 | learning rate: 2.951E-05 | global batch size: 256 | lm loss: 1.902898E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.641 | TFLOPs: 39.44 | 15: iteration 107100/ 125429 | consumed samples: 27417600 | consumed tokens: 56151244800 | elapsed time per iteration (s): 1.05 | learning rate: 2.950E-05 | global batch size: 256 | lm loss: 1.934125E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.005 | TFLOPs: 40.32 | 15: iteration 107110/ 125429 | consumed samples: 27420160 | consumed tokens: 56156487680 | elapsed time per iteration (s): 1.04 | learning rate: 2.949E-05 | global batch size: 256 | lm loss: 1.916026E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.925 | TFLOPs: 40.64 | 15: iteration 107120/ 125429 | consumed samples: 27422720 | consumed tokens: 56161730560 | elapsed time per iteration (s): 1.07 | learning rate: 2.948E-05 | global batch size: 256 | lm loss: 1.910371E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.991 | TFLOPs: 39.66 | 15: iteration 107130/ 125429 | consumed samples: 27425280 | consumed tokens: 56166973440 | elapsed time per iteration (s): 1.06 | learning rate: 2.947E-05 | global batch size: 256 | lm loss: 1.877092E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.851 | TFLOPs: 39.80 | 15: iteration 107140/ 125429 | consumed samples: 27427840 | consumed tokens: 56172216320 | elapsed time per iteration (s): 1.05 | learning rate: 2.946E-05 | global batch size: 256 | lm loss: 1.903619E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.886 | TFLOPs: 40.30 | 15: iteration 107150/ 125429 | consumed samples: 27430400 | consumed tokens: 56177459200 | elapsed time per iteration (s): 1.05 | learning rate: 2.945E-05 | global batch size: 256 | lm loss: 1.902912E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.878 | TFLOPs: 40.47 | 15: iteration 107160/ 125429 | consumed samples: 27432960 | consumed tokens: 56182702080 | elapsed time per iteration (s): 1.04 | learning rate: 2.944E-05 | global batch size: 256 | lm loss: 1.884397E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.168 | TFLOPs: 40.85 | 15: iteration 107170/ 125429 | consumed samples: 27435520 | consumed tokens: 56187944960 | elapsed time per iteration (s): 1.03 | learning rate: 2.943E-05 | global batch size: 256 | lm loss: 1.902199E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.745 | TFLOPs: 40.94 | 15: iteration 107180/ 125429 | consumed samples: 27438080 | consumed tokens: 56193187840 | elapsed time per iteration (s): 1.05 | learning rate: 2.942E-05 | global batch size: 256 | lm loss: 1.876795E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.505 | TFLOPs: 40.41 | 15: iteration 107190/ 125429 | consumed samples: 27440640 | consumed tokens: 56198430720 | elapsed time per iteration (s): 1.02 | learning rate: 2.941E-05 | global batch size: 256 | lm loss: 1.934575E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.126 | TFLOPs: 41.50 | 15: iteration 107200/ 125429 | consumed samples: 27443200 | consumed tokens: 56203673600 | elapsed time per iteration (s): 1.03 | learning rate: 2.940E-05 | global batch size: 256 | lm loss: 1.891165E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.025 | TFLOPs: 41.15 | 15: iteration 107210/ 125429 | consumed samples: 27445760 | consumed tokens: 56208916480 | elapsed time per iteration (s): 1.03 | learning rate: 2.939E-05 | global batch size: 256 | lm loss: 1.896354E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.547 | TFLOPs: 40.91 | 15: iteration 107220/ 125429 | consumed samples: 27448320 | consumed tokens: 56214159360 | elapsed time per iteration (s): 1.04 | learning rate: 2.938E-05 | global batch size: 256 | lm loss: 1.909568E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.096 | TFLOPs: 40.50 | 15: iteration 107230/ 125429 | consumed samples: 27450880 | consumed tokens: 56219402240 | elapsed time per iteration (s): 1.03 | learning rate: 2.937E-05 | global batch size: 256 | lm loss: 1.889019E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.007 | TFLOPs: 40.99 | 15: iteration 107240/ 125429 | consumed samples: 27453440 | consumed tokens: 56224645120 | elapsed time per iteration (s): 1.04 | learning rate: 2.936E-05 | global batch size: 256 | lm loss: 1.908400E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.216 | TFLOPs: 40.85 | 15: iteration 107250/ 125429 | consumed samples: 27456000 | consumed tokens: 56229888000 | elapsed time per iteration (s): 1.04 | learning rate: 2.935E-05 | global batch size: 256 | lm loss: 1.904136E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.563 | TFLOPs: 40.75 | 15: iteration 107260/ 125429 | consumed samples: 27458560 | consumed tokens: 56235130880 | elapsed time per iteration (s): 1.06 | learning rate: 2.934E-05 | global batch size: 256 | lm loss: 1.903479E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.544 | TFLOPs: 39.92 | 15: iteration 107270/ 125429 | consumed samples: 27461120 | consumed tokens: 56240373760 | elapsed time per iteration (s): 1.07 | learning rate: 2.933E-05 | global batch size: 256 | lm loss: 1.900811E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.715 | TFLOPs: 39.61 | 15: iteration 107280/ 125429 | consumed samples: 27463680 | consumed tokens: 56245616640 | elapsed time per iteration (s): 1.03 | learning rate: 2.932E-05 | global batch size: 256 | lm loss: 1.874371E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.605 | TFLOPs: 41.08 | 15: iteration 107290/ 125429 | consumed samples: 27466240 | consumed tokens: 56250859520 | elapsed time per iteration (s): 1.04 | learning rate: 2.931E-05 | global batch size: 256 | lm loss: 1.864558E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.184 | TFLOPs: 40.68 | 15: iteration 107300/ 125429 | consumed samples: 27468800 | consumed tokens: 56256102400 | elapsed time per iteration (s): 1.05 | learning rate: 2.930E-05 | global batch size: 256 | lm loss: 1.888287E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.126 | TFLOPs: 40.34 | 15: iteration 107310/ 125429 | consumed samples: 27471360 | consumed tokens: 56261345280 | elapsed time per iteration (s): 1.03 | learning rate: 2.929E-05 | global batch size: 256 | lm loss: 1.879008E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.057 | TFLOPs: 40.99 | 15: iteration 107320/ 125429 | consumed samples: 27473920 | consumed tokens: 56266588160 | elapsed time per iteration (s): 1.06 | learning rate: 2.928E-05 | global batch size: 256 | lm loss: 1.903258E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.081 | TFLOPs: 39.84 | 15: iteration 107330/ 125429 | consumed samples: 27476480 | consumed tokens: 56271831040 | elapsed time per iteration (s): 1.06 | learning rate: 2.927E-05 | global batch size: 256 | lm loss: 1.898102E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.785 | TFLOPs: 39.96 | 15: iteration 107340/ 125429 | consumed samples: 27479040 | consumed tokens: 56277073920 | elapsed time per iteration (s): 1.04 | learning rate: 2.926E-05 | global batch size: 256 | lm loss: 1.904012E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.411 | TFLOPs: 40.56 | 15: iteration 107350/ 125429 | consumed samples: 27481600 | consumed tokens: 56282316800 | elapsed time per iteration (s): 1.02 | learning rate: 2.925E-05 | global batch size: 256 | lm loss: 1.912232E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.085 | TFLOPs: 41.49 | 15: iteration 107360/ 125429 | consumed samples: 27484160 | consumed tokens: 56287559680 | elapsed time per iteration (s): 1.04 | learning rate: 2.924E-05 | global batch size: 256 | lm loss: 1.899543E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.129 | TFLOPs: 40.67 | 15: iteration 107370/ 125429 | consumed samples: 27486720 | consumed tokens: 56292802560 | elapsed time per iteration (s): 1.02 | learning rate: 2.923E-05 | global batch size: 256 | lm loss: 1.899380E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.317 | TFLOPs: 41.37 | 15: iteration 107380/ 125429 | consumed samples: 27489280 | consumed tokens: 56298045440 | elapsed time per iteration (s): 1.08 | learning rate: 2.922E-05 | global batch size: 256 | lm loss: 1.900952E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.608 | TFLOPs: 39.10 | 15: iteration 107390/ 125429 | consumed samples: 27491840 | consumed tokens: 56303288320 | elapsed time per iteration (s): 1.03 | learning rate: 2.921E-05 | global batch size: 256 | lm loss: 1.903827E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.888 | TFLOPs: 41.13 | 15: iteration 107400/ 125429 | consumed samples: 27494400 | consumed tokens: 56308531200 | elapsed time per iteration (s): 1.04 | learning rate: 2.920E-05 | global batch size: 256 | lm loss: 1.934463E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.046 | TFLOPs: 40.50 | 15: iteration 107410/ 125429 | consumed samples: 27496960 | consumed tokens: 56313774080 | elapsed time per iteration (s): 1.09 | learning rate: 2.919E-05 | global batch size: 256 | lm loss: 1.890140E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.475 | TFLOPs: 38.91 | 15: iteration 107420/ 125429 | consumed samples: 27499520 | consumed tokens: 56319016960 | elapsed time per iteration (s): 1.05 | learning rate: 2.918E-05 | global batch size: 256 | lm loss: 1.900216E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.282 | TFLOPs: 40.37 | 15: iteration 107430/ 125429 | consumed samples: 27502080 | consumed tokens: 56324259840 | elapsed time per iteration (s): 1.04 | learning rate: 2.917E-05 | global batch size: 256 | lm loss: 1.940585E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.504 | TFLOPs: 40.57 | 15: iteration 107440/ 125429 | consumed samples: 27504640 | consumed tokens: 56329502720 | elapsed time per iteration (s): 1.04 | learning rate: 2.916E-05 | global batch size: 256 | lm loss: 1.891867E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.504 | TFLOPs: 40.57 | 15: iteration 107450/ 125429 | consumed samples: 27507200 | consumed tokens: 56334745600 | elapsed time per iteration (s): 1.06 | learning rate: 2.915E-05 | global batch size: 256 | lm loss: 1.918555E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.497 | TFLOPs: 39.91 | 15: iteration 107460/ 125429 | consumed samples: 27509760 | consumed tokens: 56339988480 | elapsed time per iteration (s): 1.03 | learning rate: 2.914E-05 | global batch size: 256 | lm loss: 1.890513E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.384 | TFLOPs: 40.88 | 15: iteration 107470/ 125429 | consumed samples: 27512320 | consumed tokens: 56345231360 | elapsed time per iteration (s): 1.05 | learning rate: 2.913E-05 | global batch size: 256 | lm loss: 1.895588E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.457 | TFLOPs: 40.40 | 15: iteration 107480/ 125429 | consumed samples: 27514880 | consumed tokens: 56350474240 | elapsed time per iteration (s): 1.03 | learning rate: 2.912E-05 | global batch size: 256 | lm loss: 1.917104E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.326 | TFLOPs: 41.20 | 15: iteration 107490/ 125429 | consumed samples: 27517440 | consumed tokens: 56355717120 | elapsed time per iteration (s): 1.05 | learning rate: 2.911E-05 | global batch size: 256 | lm loss: 1.883290E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.185 | TFLOPs: 40.19 | 15: iteration 107500/ 125429 | consumed samples: 27520000 | consumed tokens: 56360960000 | elapsed time per iteration (s): 1.09 | learning rate: 2.910E-05 | global batch size: 256 | lm loss: 1.904706E+00 | grad norm: 0.148 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.947 | TFLOPs: 38.83 | 15: iteration 107510/ 125429 | consumed samples: 27522560 | consumed tokens: 56366202880 | elapsed time per iteration (s): 1.11 | learning rate: 2.909E-05 | global batch size: 256 | lm loss: 1.883748E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.665 | TFLOPs: 38.12 | 15: iteration 107520/ 125429 | consumed samples: 27525120 | consumed tokens: 56371445760 | elapsed time per iteration (s): 1.04 | learning rate: 2.908E-05 | global batch size: 256 | lm loss: 1.907505E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.946 | TFLOPs: 40.64 | 15: iteration 107530/ 125429 | consumed samples: 27527680 | consumed tokens: 56376688640 | elapsed time per iteration (s): 1.03 | learning rate: 2.907E-05 | global batch size: 256 | lm loss: 1.888852E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.828 | TFLOPs: 41.12 | 15: iteration 107540/ 125429 | consumed samples: 27530240 | consumed tokens: 56381931520 | elapsed time per iteration (s): 1.03 | learning rate: 2.906E-05 | global batch size: 256 | lm loss: 1.903800E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.876 | TFLOPs: 41.13 | 15: iteration 107550/ 125429 | consumed samples: 27532800 | consumed tokens: 56387174400 | elapsed time per iteration (s): 1.06 | learning rate: 2.905E-05 | global batch size: 256 | lm loss: 1.910980E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.623 | TFLOPs: 39.93 | 15: iteration 107560/ 125429 | consumed samples: 27535360 | consumed tokens: 56392417280 | elapsed time per iteration (s): 1.02 | learning rate: 2.904E-05 | global batch size: 256 | lm loss: 1.908738E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.800 | TFLOPs: 41.28 | 15: iteration 107570/ 125429 | consumed samples: 27537920 | consumed tokens: 56397660160 | elapsed time per iteration (s): 1.04 | learning rate: 2.903E-05 | global batch size: 256 | lm loss: 1.909268E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.729 | TFLOPs: 40.61 | 15: iteration 107580/ 125429 | consumed samples: 27540480 | consumed tokens: 56402903040 | elapsed time per iteration (s): 1.06 | learning rate: 2.902E-05 | global batch size: 256 | lm loss: 1.894309E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.904 | TFLOPs: 39.81 | 15: iteration 107590/ 125429 | consumed samples: 27543040 | consumed tokens: 56408145920 | elapsed time per iteration (s): 1.04 | learning rate: 2.901E-05 | global batch size: 256 | lm loss: 1.889244E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.813 | TFLOPs: 40.62 | 15: iteration 107600/ 125429 | consumed samples: 27545600 | consumed tokens: 56413388800 | elapsed time per iteration (s): 1.05 | learning rate: 2.900E-05 | global batch size: 256 | lm loss: 1.902186E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.701 | TFLOPs: 40.44 | 15: iteration 107610/ 125429 | consumed samples: 27548160 | consumed tokens: 56418631680 | elapsed time per iteration (s): 1.06 | learning rate: 2.899E-05 | global batch size: 256 | lm loss: 1.910815E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.414 | TFLOPs: 40.06 | 15: iteration 107620/ 125429 | consumed samples: 27550720 | consumed tokens: 56423874560 | elapsed time per iteration (s): 1.12 | learning rate: 2.898E-05 | global batch size: 256 | lm loss: 1.920728E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 229.583 | TFLOPs: 37.94 | 15: iteration 107630/ 125429 | consumed samples: 27553280 | consumed tokens: 56429117440 | elapsed time per iteration (s): 1.05 | learning rate: 2.897E-05 | global batch size: 256 | lm loss: 1.876656E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.805 | TFLOPs: 40.13 | 15: iteration 107640/ 125429 | consumed samples: 27555840 | consumed tokens: 56434360320 | elapsed time per iteration (s): 1.03 | learning rate: 2.896E-05 | global batch size: 256 | lm loss: 1.921608E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.902 | TFLOPs: 40.97 | 15: iteration 107650/ 125429 | consumed samples: 27558400 | consumed tokens: 56439603200 | elapsed time per iteration (s): 1.05 | learning rate: 2.895E-05 | global batch size: 256 | lm loss: 1.894261E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.580 | TFLOPs: 40.42 | 15: iteration 107660/ 125429 | consumed samples: 27560960 | consumed tokens: 56444846080 | elapsed time per iteration (s): 1.05 | learning rate: 2.894E-05 | global batch size: 256 | lm loss: 1.903119E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.664 | TFLOPs: 40.43 | 15: iteration 107670/ 125429 | consumed samples: 27563520 | consumed tokens: 56450088960 | elapsed time per iteration (s): 1.09 | learning rate: 2.893E-05 | global batch size: 256 | lm loss: 1.912920E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.603 | TFLOPs: 38.94 | 15: iteration 107680/ 125429 | consumed samples: 27566080 | consumed tokens: 56455331840 | elapsed time per iteration (s): 1.03 | learning rate: 2.892E-05 | global batch size: 256 | lm loss: 1.875864E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.945 | TFLOPs: 41.14 | 15: iteration 107690/ 125429 | consumed samples: 27568640 | consumed tokens: 56460574720 | elapsed time per iteration (s): 1.07 | learning rate: 2.891E-05 | global batch size: 256 | lm loss: 1.936667E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.892 | TFLOPs: 39.64 | 15: iteration 107700/ 125429 | consumed samples: 27571200 | consumed tokens: 56465817600 | elapsed time per iteration (s): 1.04 | learning rate: 2.890E-05 | global batch size: 256 | lm loss: 1.908730E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.121 | TFLOPs: 40.84 | 15: iteration 107710/ 125429 | consumed samples: 27573760 | consumed tokens: 56471060480 | elapsed time per iteration (s): 1.05 | learning rate: 2.889E-05 | global batch size: 256 | lm loss: 1.886827E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.846 | TFLOPs: 40.46 | 15: iteration 107720/ 125429 | consumed samples: 27576320 | consumed tokens: 56476303360 | elapsed time per iteration (s): 1.03 | learning rate: 2.888E-05 | global batch size: 256 | lm loss: 1.870496E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.118 | TFLOPs: 41.17 | 15: iteration 107730/ 125429 | consumed samples: 27578880 | consumed tokens: 56481546240 | elapsed time per iteration (s): 1.05 | learning rate: 2.887E-05 | global batch size: 256 | lm loss: 1.902155E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.655 | TFLOPs: 40.43 | 15: iteration 107740/ 125429 | consumed samples: 27581440 | consumed tokens: 56486789120 | elapsed time per iteration (s): 1.05 | learning rate: 2.886E-05 | global batch size: 256 | lm loss: 1.905303E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.647 | TFLOPs: 40.43 | 15: iteration 107750/ 125429 | consumed samples: 27584000 | consumed tokens: 56492032000 | elapsed time per iteration (s): 1.04 | learning rate: 2.885E-05 | global batch size: 256 | lm loss: 1.909784E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.586 | TFLOPs: 40.75 | 15: iteration 107760/ 125429 | consumed samples: 27586560 | consumed tokens: 56497274880 | elapsed time per iteration (s): 1.08 | learning rate: 2.884E-05 | global batch size: 256 | lm loss: 1.917109E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.704 | TFLOPs: 39.12 | 15: iteration 107770/ 125429 | consumed samples: 27589120 | consumed tokens: 56502517760 | elapsed time per iteration (s): 1.05 | learning rate: 2.883E-05 | global batch size: 256 | lm loss: 1.915056E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.469 | TFLOPs: 40.40 | 15: iteration 107780/ 125429 | consumed samples: 27591680 | consumed tokens: 56507760640 | elapsed time per iteration (s): 1.08 | learning rate: 2.882E-05 | global batch size: 256 | lm loss: 1.894153E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.025 | TFLOPs: 39.00 | 15: iteration 107790/ 125429 | consumed samples: 27594240 | consumed tokens: 56513003520 | elapsed time per iteration (s): 1.03 | learning rate: 2.881E-05 | global batch size: 256 | lm loss: 1.912371E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.694 | TFLOPs: 41.26 | 15: iteration 107800/ 125429 | consumed samples: 27596800 | consumed tokens: 56518246400 | elapsed time per iteration (s): 1.02 | learning rate: 2.880E-05 | global batch size: 256 | lm loss: 1.910297E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.344 | TFLOPs: 41.37 | 15: iteration 107810/ 125429 | consumed samples: 27599360 | consumed tokens: 56523489280 | elapsed time per iteration (s): 1.04 | learning rate: 2.879E-05 | global batch size: 256 | lm loss: 1.905809E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.112 | TFLOPs: 40.84 | 15: iteration 107820/ 125429 | consumed samples: 27601920 | consumed tokens: 56528732160 | elapsed time per iteration (s): 1.03 | learning rate: 2.878E-05 | global batch size: 256 | lm loss: 1.901315E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.951 | TFLOPs: 41.14 | 15: iteration 107830/ 125429 | consumed samples: 27604480 | consumed tokens: 56533975040 | elapsed time per iteration (s): 1.05 | learning rate: 2.877E-05 | global batch size: 256 | lm loss: 1.886886E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.631 | TFLOPs: 40.26 | 15: iteration 107840/ 125429 | consumed samples: 27607040 | consumed tokens: 56539217920 | elapsed time per iteration (s): 1.06 | learning rate: 2.877E-05 | global batch size: 256 | lm loss: 1.911740E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.831 | TFLOPs: 39.96 | 15: iteration 107850/ 125429 | consumed samples: 27609600 | consumed tokens: 56544460800 | elapsed time per iteration (s): 1.03 | learning rate: 2.876E-05 | global batch size: 256 | lm loss: 1.912046E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.875 | TFLOPs: 40.96 | 15: iteration 107860/ 125429 | consumed samples: 27612160 | consumed tokens: 56549703680 | elapsed time per iteration (s): 1.04 | learning rate: 2.875E-05 | global batch size: 256 | lm loss: 1.854563E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.029 | TFLOPs: 40.49 | 15: iteration 107870/ 125429 | consumed samples: 27614720 | consumed tokens: 56554946560 | elapsed time per iteration (s): 1.04 | learning rate: 2.874E-05 | global batch size: 256 | lm loss: 1.898625E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.108 | TFLOPs: 40.84 | 15: iteration 107880/ 125429 | consumed samples: 27617280 | consumed tokens: 56560189440 | elapsed time per iteration (s): 1.03 | learning rate: 2.873E-05 | global batch size: 256 | lm loss: 1.901585E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.561 | TFLOPs: 40.91 | 15: iteration 107890/ 125429 | consumed samples: 27619840 | consumed tokens: 56565432320 | elapsed time per iteration (s): 1.05 | learning rate: 2.872E-05 | global batch size: 256 | lm loss: 1.902569E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.166 | TFLOPs: 40.35 | 15: iteration 107900/ 125429 | consumed samples: 27622400 | consumed tokens: 56570675200 | elapsed time per iteration (s): 2.45 | learning rate: 2.871E-05 | global batch size: 256 | lm loss: 1.906952E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 104.665 | TFLOPs: 17.30 | 15: iteration 107910/ 125429 | consumed samples: 27624960 | consumed tokens: 56575918080 | elapsed time per iteration (s): 1.03 | learning rate: 2.870E-05 | global batch size: 256 | lm loss: 1.905173E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.519 | TFLOPs: 41.23 | 15: iteration 107920/ 125429 | consumed samples: 27627520 | consumed tokens: 56581160960 | elapsed time per iteration (s): 1.04 | learning rate: 2.869E-05 | global batch size: 256 | lm loss: 1.895441E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.143 | TFLOPs: 40.68 | 15: iteration 107930/ 125429 | consumed samples: 27630080 | consumed tokens: 56586403840 | elapsed time per iteration (s): 1.03 | learning rate: 2.868E-05 | global batch size: 256 | lm loss: 1.892895E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.323 | TFLOPs: 41.20 | 15: iteration 107940/ 125429 | consumed samples: 27632640 | consumed tokens: 56591646720 | elapsed time per iteration (s): 1.03 | learning rate: 2.867E-05 | global batch size: 256 | lm loss: 1.875846E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.223 | TFLOPs: 41.19 | 15: iteration 107950/ 125429 | consumed samples: 27635200 | consumed tokens: 56596889600 | elapsed time per iteration (s): 1.05 | learning rate: 2.866E-05 | global batch size: 256 | lm loss: 1.895860E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.563 | TFLOPs: 40.42 | 15: iteration 107960/ 125429 | consumed samples: 27637760 | consumed tokens: 56602132480 | elapsed time per iteration (s): 1.03 | learning rate: 2.865E-05 | global batch size: 256 | lm loss: 1.888063E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.513 | TFLOPs: 41.07 | 15: iteration 107970/ 125429 | consumed samples: 27640320 | consumed tokens: 56607375360 | elapsed time per iteration (s): 1.05 | learning rate: 2.864E-05 | global batch size: 256 | lm loss: 1.913215E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.246 | TFLOPs: 40.20 | 15: iteration 107980/ 125429 | consumed samples: 27642880 | consumed tokens: 56612618240 | elapsed time per iteration (s): 1.05 | learning rate: 2.863E-05 | global batch size: 256 | lm loss: 1.862551E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.790 | TFLOPs: 40.29 | 15: iteration 107990/ 125429 | consumed samples: 27645440 | consumed tokens: 56617861120 | elapsed time per iteration (s): 1.04 | learning rate: 2.862E-05 | global batch size: 256 | lm loss: 1.881251E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.809 | TFLOPs: 40.79 | 0: [2022-11-27 03:59:59,029] [INFO] [logging.py:68:log_dist] [Rank 0] step=108000, skipped=0, lr=[2.8608847503752837e-05, 2.8608847503752837e-05, 2.8608847503752837e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 108000/ 125429 | consumed samples: 27648000 | consumed tokens: 56623104000 | elapsed time per iteration (s): 1.06 | learning rate: 2.861E-05 | global batch size: 256 | lm loss: 1.919972E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.860 | TFLOPs: 39.80 | 0: steps: 108000 loss: 1.9743 iter time (s): 1.061 samples/sec: 241.282 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 108000 | lm loss value: 1.880392E+00 | lm loss PPL: 6.556072E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 108000 to checkpoints_1b5 0: [2022-11-27 03:59:59,380] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step108000 is begin to save! 0: [2022-11-27 03:59:59,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_01-model_00-model_states.pt... 0: [2022-11-27 03:59:59,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_01-model_00-model_states.pt. 0: [2022-11-27 03:59:59,648] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_03-model_00-model_states.pt... 0: [2022-11-27 03:59:59,756] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_03-model_00-model_states.pt. 0: [2022-11-27 03:59:59,757] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_04-model_00-model_states.pt... 0: [2022-11-27 03:59:59,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_04-model_00-model_states.pt. 0: [2022-11-27 03:59:59,870] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_05-model_00-model_states.pt... 0: [2022-11-27 03:59:59,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_05-model_00-model_states.pt. 0: [2022-11-27 03:59:59,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:00:00,082] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:00:00,082] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:00:00,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:00:00,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:00:00,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:00:00,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:00:00,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:00:00,395] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:00:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:00:00,502] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:00:00,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:00:00,606] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:00:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:00:00,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:00:00,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:00:00,820] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:00:00,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:00:00,923] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:00:01,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:00:01,032] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:00:01,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:00:01,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:00:01,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:00:01,243] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:00:01,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:00:01,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:00:01,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:00:01,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:00:01,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:00:01,557] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:00:01,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:00:01,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:00:01,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:00:01,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:00:01,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:00:01,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:00:01,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:00:01,977] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:00:02,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:00:02,079] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:00:02,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:00:02,186] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:00:02,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:00:02,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:00:02,397] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:00:02,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_29-model_00-model_states.pt... 0: [2022-11-27 04:00:02,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_29-model_00-model_states.pt. 0: [2022-11-27 04:00:02,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:00:02,605] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:00:02,605] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/layer_32-model_00-model_states.pt... 0: [2022-11-27 04:00:02,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/layer_32-model_00-model_states.pt. 0: [2022-11-27 04:00:02,613] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step108000/mp_rank_00_model_states.pt 0: [2022-11-27 04:00:02,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:00:02,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:00:02,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step108000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:00:02,817] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,817] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 04:00:02,817] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-27 04:00:02,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-27 04:00:02,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,824] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,824] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,824] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:00:02,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:00:02,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-27 04:00:02,828] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,829] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-27 04:00:02,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,829] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-27 04:00:02,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:02,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:02,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:02,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-27 04:00:02,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-27 04:00:02,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:00:02,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:00:02,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-27 04:00:02,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:02,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:02,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:02,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-27 04:00:02,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-27 04:00:02,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-27 04:00:02,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 10: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:00:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-27 04:00:02,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 04:00:02,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-27 04:00:02,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,847] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 04:00:02,847] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:00:02,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-27 04:00:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 9: [2022-11-27 04:00:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-27 04:00:02,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-27 04:00:02,845] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:02,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:02,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 9: [2022-11-27 04:00:02,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:00:02,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 3: [2022-11-27 04:00:02,851] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:00:02,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:00:02,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-27 04:00:02,852] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,852] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,852] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,840] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,840] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-27 04:00:02,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,832] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:00:02,832] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,832] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:00:02,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:00:02,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:00:02,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:00:02,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 5: [2022-11-27 04:00:02,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:00:02,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:00:02,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:00:02,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 04:00:02,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-27 04:00:02,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-27 04:00:02,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-27 04:00:02,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,874] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,874] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 8: [2022-11-27 04:00:02,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:00:02,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 04:00:02,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-27 04:00:02,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 6: [2022-11-27 04:00:02,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:00:02,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:00:02,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-27 04:00:02,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-27 04:00:02,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 2: [2022-11-27 04:00:02,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:00:02,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:00:02,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:00:02,854] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,854] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:00:02,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:00:02,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:00:02,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 13: [2022-11-27 04:00:02,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 7: [2022-11-27 04:00:02,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:00:02,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:00:02,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 11: [2022-11-27 04:00:02,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 04:00:02,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:00:02,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 1: [2022-11-27 04:00:02,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:00:02,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 04:00:02,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 14: [2022-11-27 04:00:02,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 4: [2022-11-27 04:00:02,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 04:00:02,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 12: [2022-11-27 04:00:02,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:02,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:02,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:02,999] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:00:03,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:03,000] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:03,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:03,000] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: [2022-11-27 04:00:03,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:00:03,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 15: [2022-11-27 04:00:03,035] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step108000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 04:00:03,035] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step108000 is ready now! 0: successfully saved checkpoint at iteration 108000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3696.45 15: iteration 108010/ 125429 | consumed samples: 27650560 | consumed tokens: 56628346880 | elapsed time per iteration (s): 1.43 | learning rate: 2.860E-05 | global batch size: 256 | lm loss: 1.906295E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.987 | TFLOPs: 29.58 | 15: iteration 108020/ 125429 | consumed samples: 27653120 | consumed tokens: 56633589760 | elapsed time per iteration (s): 1.04 | learning rate: 2.859E-05 | global batch size: 256 | lm loss: 1.907613E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.878 | TFLOPs: 40.80 | 15: iteration 108030/ 125429 | consumed samples: 27655680 | consumed tokens: 56638832640 | elapsed time per iteration (s): 1.16 | learning rate: 2.858E-05 | global batch size: 256 | lm loss: 1.872421E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 220.964 | TFLOPs: 36.52 | 15: iteration 108040/ 125429 | consumed samples: 27658240 | consumed tokens: 56644075520 | elapsed time per iteration (s): 1.04 | learning rate: 2.857E-05 | global batch size: 256 | lm loss: 1.896576E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.108 | TFLOPs: 40.84 | 15: iteration 108050/ 125429 | consumed samples: 27660800 | consumed tokens: 56649318400 | elapsed time per iteration (s): 1.02 | learning rate: 2.856E-05 | global batch size: 256 | lm loss: 1.909457E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.957 | TFLOPs: 41.31 | 15: iteration 108060/ 125429 | consumed samples: 27663360 | consumed tokens: 56654561280 | elapsed time per iteration (s): 1.04 | learning rate: 2.855E-05 | global batch size: 256 | lm loss: 1.876112E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.822 | TFLOPs: 40.79 | 15: iteration 108070/ 125429 | consumed samples: 27665920 | consumed tokens: 56659804160 | elapsed time per iteration (s): 1.05 | learning rate: 2.854E-05 | global batch size: 256 | lm loss: 1.902961E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.950 | TFLOPs: 40.48 | 15: iteration 108080/ 125429 | consumed samples: 27668480 | consumed tokens: 56665047040 | elapsed time per iteration (s): 1.03 | learning rate: 2.853E-05 | global batch size: 256 | lm loss: 1.890312E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.857 | TFLOPs: 41.13 | 15: iteration 108090/ 125429 | consumed samples: 27671040 | consumed tokens: 56670289920 | elapsed time per iteration (s): 1.04 | learning rate: 2.852E-05 | global batch size: 256 | lm loss: 1.906346E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.987 | TFLOPs: 40.82 | 15: iteration 108100/ 125429 | consumed samples: 27673600 | consumed tokens: 56675532800 | elapsed time per iteration (s): 1.02 | learning rate: 2.851E-05 | global batch size: 256 | lm loss: 1.915223E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.062 | TFLOPs: 41.32 | 15: iteration 108110/ 125429 | consumed samples: 27676160 | consumed tokens: 56680775680 | elapsed time per iteration (s): 1.07 | learning rate: 2.850E-05 | global batch size: 256 | lm loss: 1.896030E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.819 | TFLOPs: 39.47 | 15: iteration 108120/ 125429 | consumed samples: 27678720 | consumed tokens: 56686018560 | elapsed time per iteration (s): 1.06 | learning rate: 2.849E-05 | global batch size: 256 | lm loss: 1.890361E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.395 | TFLOPs: 39.89 | 15: iteration 108130/ 125429 | consumed samples: 27681280 | consumed tokens: 56691261440 | elapsed time per iteration (s): 1.04 | learning rate: 2.848E-05 | global batch size: 256 | lm loss: 1.892936E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.708 | TFLOPs: 40.61 | 15: iteration 108140/ 125429 | consumed samples: 27683840 | consumed tokens: 56696504320 | elapsed time per iteration (s): 1.05 | learning rate: 2.847E-05 | global batch size: 256 | lm loss: 1.894625E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.060 | TFLOPs: 40.33 | 15: iteration 108150/ 125429 | consumed samples: 27686400 | consumed tokens: 56701747200 | elapsed time per iteration (s): 1.04 | learning rate: 2.846E-05 | global batch size: 256 | lm loss: 1.897150E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.096 | TFLOPs: 40.67 | 15: iteration 108160/ 125429 | consumed samples: 27688960 | consumed tokens: 56706990080 | elapsed time per iteration (s): 1.05 | learning rate: 2.845E-05 | global batch size: 256 | lm loss: 1.902550E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.684 | TFLOPs: 40.44 | 15: iteration 108170/ 125429 | consumed samples: 27691520 | consumed tokens: 56712232960 | elapsed time per iteration (s): 1.05 | learning rate: 2.844E-05 | global batch size: 256 | lm loss: 1.894894E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.041 | TFLOPs: 40.16 | 15: iteration 108180/ 125429 | consumed samples: 27694080 | consumed tokens: 56717475840 | elapsed time per iteration (s): 1.05 | learning rate: 2.843E-05 | global batch size: 256 | lm loss: 1.895940E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.150 | TFLOPs: 40.35 | 15: iteration 108190/ 125429 | consumed samples: 27696640 | consumed tokens: 56722718720 | elapsed time per iteration (s): 1.06 | learning rate: 2.843E-05 | global batch size: 256 | lm loss: 1.891131E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.580 | TFLOPs: 40.09 | 15: iteration 108200/ 125429 | consumed samples: 27699200 | consumed tokens: 56727961600 | elapsed time per iteration (s): 1.05 | learning rate: 2.842E-05 | global batch size: 256 | lm loss: 1.919081E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.921 | TFLOPs: 40.31 | 15: iteration 108210/ 125429 | consumed samples: 27701760 | consumed tokens: 56733204480 | elapsed time per iteration (s): 1.03 | learning rate: 2.841E-05 | global batch size: 256 | lm loss: 1.897422E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.237 | TFLOPs: 41.02 | 15: iteration 108220/ 125429 | consumed samples: 27704320 | consumed tokens: 56738447360 | elapsed time per iteration (s): 1.04 | learning rate: 2.840E-05 | global batch size: 256 | lm loss: 1.903704E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.386 | TFLOPs: 40.55 | 15: iteration 108230/ 125429 | consumed samples: 27706880 | consumed tokens: 56743690240 | elapsed time per iteration (s): 1.05 | learning rate: 2.839E-05 | global batch size: 256 | lm loss: 1.886529E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.646 | TFLOPs: 40.43 | 15: iteration 108240/ 125429 | consumed samples: 27709440 | consumed tokens: 56748933120 | elapsed time per iteration (s): 1.05 | learning rate: 2.838E-05 | global batch size: 256 | lm loss: 1.882961E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.367 | TFLOPs: 40.38 | 15: iteration 108250/ 125429 | consumed samples: 27712000 | consumed tokens: 56754176000 | elapsed time per iteration (s): 1.04 | learning rate: 2.837E-05 | global batch size: 256 | lm loss: 1.910193E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.474 | TFLOPs: 40.57 | 15: iteration 108260/ 125429 | consumed samples: 27714560 | consumed tokens: 56759418880 | elapsed time per iteration (s): 1.07 | learning rate: 2.836E-05 | global batch size: 256 | lm loss: 1.882920E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.583 | TFLOPs: 39.59 | 15: iteration 108270/ 125429 | consumed samples: 27717120 | consumed tokens: 56764661760 | elapsed time per iteration (s): 1.04 | learning rate: 2.835E-05 | global batch size: 256 | lm loss: 1.898306E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.027 | TFLOPs: 40.66 | 15: iteration 108280/ 125429 | consumed samples: 27719680 | consumed tokens: 56769904640 | elapsed time per iteration (s): 1.03 | learning rate: 2.834E-05 | global batch size: 256 | lm loss: 1.925293E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.805 | TFLOPs: 41.12 | 15: iteration 108290/ 125429 | consumed samples: 27722240 | consumed tokens: 56775147520 | elapsed time per iteration (s): 1.06 | learning rate: 2.833E-05 | global batch size: 256 | lm loss: 1.901404E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.666 | TFLOPs: 39.77 | 15: iteration 108300/ 125429 | consumed samples: 27724800 | consumed tokens: 56780390400 | elapsed time per iteration (s): 1.05 | learning rate: 2.832E-05 | global batch size: 256 | lm loss: 1.906338E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.533 | TFLOPs: 40.25 | 15: iteration 108310/ 125429 | consumed samples: 27727360 | consumed tokens: 56785633280 | elapsed time per iteration (s): 1.06 | learning rate: 2.831E-05 | global batch size: 256 | lm loss: 1.891101E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.630 | TFLOPs: 39.93 | 15: iteration 108320/ 125429 | consumed samples: 27729920 | consumed tokens: 56790876160 | elapsed time per iteration (s): 1.05 | learning rate: 2.830E-05 | global batch size: 256 | lm loss: 1.903667E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.547 | TFLOPs: 40.41 | 15: iteration 108330/ 125429 | consumed samples: 27732480 | consumed tokens: 56796119040 | elapsed time per iteration (s): 1.03 | learning rate: 2.829E-05 | global batch size: 256 | lm loss: 1.877708E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.515 | TFLOPs: 40.90 | 15: iteration 108340/ 125429 | consumed samples: 27735040 | consumed tokens: 56801361920 | elapsed time per iteration (s): 1.03 | learning rate: 2.828E-05 | global batch size: 256 | lm loss: 1.897071E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.479 | TFLOPs: 41.06 | 15: iteration 108350/ 125429 | consumed samples: 27737600 | consumed tokens: 56806604800 | elapsed time per iteration (s): 1.03 | learning rate: 2.827E-05 | global batch size: 256 | lm loss: 1.908529E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.770 | TFLOPs: 40.95 | 15: iteration 108360/ 125429 | consumed samples: 27740160 | consumed tokens: 56811847680 | elapsed time per iteration (s): 1.04 | learning rate: 2.826E-05 | global batch size: 256 | lm loss: 1.919389E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.008 | TFLOPs: 40.65 | 15: iteration 108370/ 125429 | consumed samples: 27742720 | consumed tokens: 56817090560 | elapsed time per iteration (s): 1.19 | learning rate: 2.825E-05 | global batch size: 256 | lm loss: 1.919991E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.298 | TFLOPs: 35.41 | 15: iteration 108380/ 125429 | consumed samples: 27745280 | consumed tokens: 56822333440 | elapsed time per iteration (s): 1.04 | learning rate: 2.824E-05 | global batch size: 256 | lm loss: 1.915970E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.420 | TFLOPs: 40.72 | 15: iteration 108390/ 125429 | consumed samples: 27747840 | consumed tokens: 56827576320 | elapsed time per iteration (s): 1.04 | learning rate: 2.823E-05 | global batch size: 256 | lm loss: 1.906650E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.198 | TFLOPs: 40.69 | 15: iteration 108400/ 125429 | consumed samples: 27750400 | consumed tokens: 56832819200 | elapsed time per iteration (s): 1.04 | learning rate: 2.822E-05 | global batch size: 256 | lm loss: 1.900385E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.573 | TFLOPs: 40.58 | 15: iteration 108410/ 125429 | consumed samples: 27752960 | consumed tokens: 56838062080 | elapsed time per iteration (s): 1.06 | learning rate: 2.821E-05 | global batch size: 256 | lm loss: 1.907109E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.499 | TFLOPs: 39.74 | 15: iteration 108420/ 125429 | consumed samples: 27755520 | consumed tokens: 56843304960 | elapsed time per iteration (s): 1.05 | learning rate: 2.821E-05 | global batch size: 256 | lm loss: 1.897773E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.967 | TFLOPs: 40.15 | 15: iteration 108430/ 125429 | consumed samples: 27758080 | consumed tokens: 56848547840 | elapsed time per iteration (s): 1.05 | learning rate: 2.820E-05 | global batch size: 256 | lm loss: 1.904926E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.887 | TFLOPs: 40.30 | 15: iteration 108440/ 125429 | consumed samples: 27760640 | consumed tokens: 56853790720 | elapsed time per iteration (s): 1.04 | learning rate: 2.819E-05 | global batch size: 256 | lm loss: 1.896415E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.426 | TFLOPs: 40.72 | 15: iteration 108450/ 125429 | consumed samples: 27763200 | consumed tokens: 56859033600 | elapsed time per iteration (s): 1.05 | learning rate: 2.818E-05 | global batch size: 256 | lm loss: 1.890517E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.093 | TFLOPs: 40.17 | 15: iteration 108460/ 125429 | consumed samples: 27765760 | consumed tokens: 56864276480 | elapsed time per iteration (s): 1.04 | learning rate: 2.817E-05 | global batch size: 256 | lm loss: 1.895575E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.014 | TFLOPs: 40.49 | 15: iteration 108470/ 125429 | consumed samples: 27768320 | consumed tokens: 56869519360 | elapsed time per iteration (s): 1.08 | learning rate: 2.816E-05 | global batch size: 256 | lm loss: 1.909222E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.658 | TFLOPs: 39.11 | 15: iteration 108480/ 125429 | consumed samples: 27770880 | consumed tokens: 56874762240 | elapsed time per iteration (s): 1.03 | learning rate: 2.815E-05 | global batch size: 256 | lm loss: 1.918730E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.102 | TFLOPs: 41.00 | 15: iteration 108490/ 125429 | consumed samples: 27773440 | consumed tokens: 56880005120 | elapsed time per iteration (s): 1.04 | learning rate: 2.814E-05 | global batch size: 256 | lm loss: 1.894195E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.049 | TFLOPs: 40.50 | 15: iteration 108500/ 125429 | consumed samples: 27776000 | consumed tokens: 56885248000 | elapsed time per iteration (s): 1.06 | learning rate: 2.813E-05 | global batch size: 256 | lm loss: 1.915309E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.596 | TFLOPs: 40.09 | 15: iteration 108510/ 125429 | consumed samples: 27778560 | consumed tokens: 56890490880 | elapsed time per iteration (s): 1.04 | learning rate: 2.812E-05 | global batch size: 256 | lm loss: 1.924533E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.205 | TFLOPs: 40.52 | 15: iteration 108520/ 125429 | consumed samples: 27781120 | consumed tokens: 56895733760 | elapsed time per iteration (s): 1.06 | learning rate: 2.811E-05 | global batch size: 256 | lm loss: 1.877738E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.508 | TFLOPs: 39.91 | 15: iteration 108530/ 125429 | consumed samples: 27783680 | consumed tokens: 56900976640 | elapsed time per iteration (s): 1.04 | learning rate: 2.810E-05 | global batch size: 256 | lm loss: 1.897027E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.476 | TFLOPs: 40.57 | 15: iteration 108540/ 125429 | consumed samples: 27786240 | consumed tokens: 56906219520 | elapsed time per iteration (s): 1.03 | learning rate: 2.809E-05 | global batch size: 256 | lm loss: 1.880322E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.237 | TFLOPs: 41.02 | 15: iteration 108550/ 125429 | consumed samples: 27788800 | consumed tokens: 56911462400 | elapsed time per iteration (s): 1.09 | learning rate: 2.808E-05 | global batch size: 256 | lm loss: 1.931199E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.532 | TFLOPs: 38.76 | 15: iteration 108560/ 125429 | consumed samples: 27791360 | consumed tokens: 56916705280 | elapsed time per iteration (s): 1.04 | learning rate: 2.807E-05 | global batch size: 256 | lm loss: 1.914037E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.923 | TFLOPs: 40.64 | 15: iteration 108570/ 125429 | consumed samples: 27793920 | consumed tokens: 56921948160 | elapsed time per iteration (s): 1.05 | learning rate: 2.806E-05 | global batch size: 256 | lm loss: 1.883468E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.266 | TFLOPs: 40.37 | 15: iteration 108580/ 125429 | consumed samples: 27796480 | consumed tokens: 56927191040 | elapsed time per iteration (s): 1.03 | learning rate: 2.805E-05 | global batch size: 256 | lm loss: 1.889721E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.613 | TFLOPs: 40.92 | 15: iteration 108590/ 125429 | consumed samples: 27799040 | consumed tokens: 56932433920 | elapsed time per iteration (s): 1.02 | learning rate: 2.804E-05 | global batch size: 256 | lm loss: 1.931403E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.196 | TFLOPs: 41.35 | 15: iteration 108600/ 125429 | consumed samples: 27801600 | consumed tokens: 56937676800 | elapsed time per iteration (s): 1.04 | learning rate: 2.804E-05 | global batch size: 256 | lm loss: 1.928789E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.799 | TFLOPs: 40.79 | 15: iteration 108610/ 125429 | consumed samples: 27804160 | consumed tokens: 56942919680 | elapsed time per iteration (s): 1.03 | learning rate: 2.803E-05 | global batch size: 256 | lm loss: 1.898837E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.573 | TFLOPs: 40.91 | 15: iteration 108620/ 125429 | consumed samples: 27806720 | consumed tokens: 56948162560 | elapsed time per iteration (s): 1.05 | learning rate: 2.802E-05 | global batch size: 256 | lm loss: 1.913802E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.754 | TFLOPs: 40.45 | 15: iteration 108630/ 125429 | consumed samples: 27809280 | consumed tokens: 56953405440 | elapsed time per iteration (s): 1.05 | learning rate: 2.801E-05 | global batch size: 256 | lm loss: 1.921086E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.209 | TFLOPs: 40.19 | 15: iteration 108640/ 125429 | consumed samples: 27811840 | consumed tokens: 56958648320 | elapsed time per iteration (s): 1.02 | learning rate: 2.800E-05 | global batch size: 256 | lm loss: 1.892390E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.066 | TFLOPs: 41.33 | 15: iteration 108650/ 125429 | consumed samples: 27814400 | consumed tokens: 56963891200 | elapsed time per iteration (s): 1.04 | learning rate: 2.799E-05 | global batch size: 256 | lm loss: 1.918489E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.205 | TFLOPs: 40.85 | 15: iteration 108660/ 125429 | consumed samples: 27816960 | consumed tokens: 56969134080 | elapsed time per iteration (s): 1.03 | learning rate: 2.798E-05 | global batch size: 256 | lm loss: 1.899457E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.098 | TFLOPs: 41.17 | 15: iteration 108670/ 125429 | consumed samples: 27819520 | consumed tokens: 56974376960 | elapsed time per iteration (s): 1.03 | learning rate: 2.797E-05 | global batch size: 256 | lm loss: 1.897066E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.264 | TFLOPs: 41.03 | 15: iteration 108680/ 125429 | consumed samples: 27822080 | consumed tokens: 56979619840 | elapsed time per iteration (s): 1.05 | learning rate: 2.796E-05 | global batch size: 256 | lm loss: 1.888133E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.709 | TFLOPs: 40.11 | 15: iteration 108690/ 125429 | consumed samples: 27824640 | consumed tokens: 56984862720 | elapsed time per iteration (s): 1.03 | learning rate: 2.795E-05 | global batch size: 256 | lm loss: 1.911984E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.183 | TFLOPs: 41.01 | 15: iteration 108700/ 125429 | consumed samples: 27827200 | consumed tokens: 56990105600 | elapsed time per iteration (s): 1.03 | learning rate: 2.794E-05 | global batch size: 256 | lm loss: 1.897548E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.654 | TFLOPs: 41.09 | 15: iteration 108710/ 125429 | consumed samples: 27829760 | consumed tokens: 56995348480 | elapsed time per iteration (s): 1.03 | learning rate: 2.793E-05 | global batch size: 256 | lm loss: 1.911996E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.918 | TFLOPs: 40.97 | 15: iteration 108720/ 125429 | consumed samples: 27832320 | consumed tokens: 57000591360 | elapsed time per iteration (s): 1.04 | learning rate: 2.792E-05 | global batch size: 256 | lm loss: 1.892965E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.956 | TFLOPs: 40.81 | 15: iteration 108730/ 125429 | consumed samples: 27834880 | consumed tokens: 57005834240 | elapsed time per iteration (s): 1.03 | learning rate: 2.791E-05 | global batch size: 256 | lm loss: 1.891451E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.138 | TFLOPs: 41.01 | 15: iteration 108740/ 125429 | consumed samples: 27837440 | consumed tokens: 57011077120 | elapsed time per iteration (s): 1.04 | learning rate: 2.790E-05 | global batch size: 256 | lm loss: 1.890558E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.840 | TFLOPs: 40.63 | 15: iteration 108750/ 125429 | consumed samples: 27840000 | consumed tokens: 57016320000 | elapsed time per iteration (s): 1.05 | learning rate: 2.789E-05 | global batch size: 256 | lm loss: 1.924062E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.874 | TFLOPs: 40.14 | 15: iteration 108760/ 125429 | consumed samples: 27842560 | consumed tokens: 57021562880 | elapsed time per iteration (s): 1.03 | learning rate: 2.789E-05 | global batch size: 256 | lm loss: 1.921099E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.515 | TFLOPs: 40.90 | 15: iteration 108770/ 125429 | consumed samples: 27845120 | consumed tokens: 57026805760 | elapsed time per iteration (s): 1.04 | learning rate: 2.788E-05 | global batch size: 256 | lm loss: 1.896009E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.334 | TFLOPs: 40.54 | 15: iteration 108780/ 125429 | consumed samples: 27847680 | consumed tokens: 57032048640 | elapsed time per iteration (s): 1.07 | learning rate: 2.787E-05 | global batch size: 256 | lm loss: 1.940315E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.372 | TFLOPs: 39.72 | 15: iteration 108790/ 125429 | consumed samples: 27850240 | consumed tokens: 57037291520 | elapsed time per iteration (s): 1.04 | learning rate: 2.786E-05 | global batch size: 256 | lm loss: 1.908614E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.979 | TFLOPs: 40.65 | 15: iteration 108800/ 125429 | consumed samples: 27852800 | consumed tokens: 57042534400 | elapsed time per iteration (s): 1.03 | learning rate: 2.785E-05 | global batch size: 256 | lm loss: 1.901731E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.917 | TFLOPs: 40.97 | 15: iteration 108810/ 125429 | consumed samples: 27855360 | consumed tokens: 57047777280 | elapsed time per iteration (s): 1.03 | learning rate: 2.784E-05 | global batch size: 256 | lm loss: 1.914032E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.809 | TFLOPs: 41.12 | 15: iteration 108820/ 125429 | consumed samples: 27857920 | consumed tokens: 57053020160 | elapsed time per iteration (s): 1.06 | learning rate: 2.783E-05 | global batch size: 256 | lm loss: 1.957735E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.950 | TFLOPs: 39.82 | 15: iteration 108830/ 125429 | consumed samples: 27860480 | consumed tokens: 57058263040 | elapsed time per iteration (s): 1.03 | learning rate: 2.782E-05 | global batch size: 256 | lm loss: 1.884590E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.455 | TFLOPs: 41.06 | 15: iteration 108840/ 125429 | consumed samples: 27863040 | consumed tokens: 57063505920 | elapsed time per iteration (s): 1.05 | learning rate: 2.781E-05 | global batch size: 256 | lm loss: 1.892990E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.427 | TFLOPs: 40.39 | 15: iteration 108850/ 125429 | consumed samples: 27865600 | consumed tokens: 57068748800 | elapsed time per iteration (s): 1.04 | learning rate: 2.780E-05 | global batch size: 256 | lm loss: 1.884530E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.158 | TFLOPs: 40.68 | 15: iteration 108860/ 125429 | consumed samples: 27868160 | consumed tokens: 57073991680 | elapsed time per iteration (s): 1.02 | learning rate: 2.779E-05 | global batch size: 256 | lm loss: 1.886545E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.852 | TFLOPs: 41.29 | 15: iteration 108870/ 125429 | consumed samples: 27870720 | consumed tokens: 57079234560 | elapsed time per iteration (s): 1.03 | learning rate: 2.778E-05 | global batch size: 256 | lm loss: 1.920963E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.016 | TFLOPs: 40.99 | 15: iteration 108880/ 125429 | consumed samples: 27873280 | consumed tokens: 57084477440 | elapsed time per iteration (s): 1.04 | learning rate: 2.777E-05 | global batch size: 256 | lm loss: 1.917035E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.027 | TFLOPs: 40.49 | 15: iteration 108890/ 125429 | consumed samples: 27875840 | consumed tokens: 57089720320 | elapsed time per iteration (s): 1.20 | learning rate: 2.776E-05 | global batch size: 256 | lm loss: 1.897651E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.939 | TFLOPs: 35.36 | 15: iteration 108900/ 125429 | consumed samples: 27878400 | consumed tokens: 57094963200 | elapsed time per iteration (s): 1.03 | learning rate: 2.776E-05 | global batch size: 256 | lm loss: 1.881073E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.506 | TFLOPs: 41.07 | 15: iteration 108910/ 125429 | consumed samples: 27880960 | consumed tokens: 57100206080 | elapsed time per iteration (s): 1.06 | learning rate: 2.775E-05 | global batch size: 256 | lm loss: 1.908410E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.450 | TFLOPs: 39.74 | 15: iteration 108920/ 125429 | consumed samples: 27883520 | consumed tokens: 57105448960 | elapsed time per iteration (s): 1.05 | learning rate: 2.774E-05 | global batch size: 256 | lm loss: 1.903416E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.553 | TFLOPs: 40.41 | 15: iteration 108930/ 125429 | consumed samples: 27886080 | consumed tokens: 57110691840 | elapsed time per iteration (s): 1.03 | learning rate: 2.773E-05 | global batch size: 256 | lm loss: 1.892681E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.062 | TFLOPs: 40.99 | 15: iteration 108940/ 125429 | consumed samples: 27888640 | consumed tokens: 57115934720 | elapsed time per iteration (s): 1.09 | learning rate: 2.772E-05 | global batch size: 256 | lm loss: 1.869502E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.593 | TFLOPs: 38.93 | 15: iteration 108950/ 125429 | consumed samples: 27891200 | consumed tokens: 57121177600 | elapsed time per iteration (s): 1.03 | learning rate: 2.771E-05 | global batch size: 256 | lm loss: 1.931771E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.967 | TFLOPs: 40.98 | 15: iteration 108960/ 125429 | consumed samples: 27893760 | consumed tokens: 57126420480 | elapsed time per iteration (s): 1.09 | learning rate: 2.770E-05 | global batch size: 256 | lm loss: 1.902609E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.663 | TFLOPs: 38.78 | 15: iteration 108970/ 125429 | consumed samples: 27896320 | consumed tokens: 57131663360 | elapsed time per iteration (s): 1.03 | learning rate: 2.769E-05 | global batch size: 256 | lm loss: 1.898545E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.377 | TFLOPs: 41.21 | 15: iteration 108980/ 125429 | consumed samples: 27898880 | consumed tokens: 57136906240 | elapsed time per iteration (s): 1.02 | learning rate: 2.768E-05 | global batch size: 256 | lm loss: 1.893847E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.383 | TFLOPs: 41.54 | 15: iteration 108990/ 125429 | consumed samples: 27901440 | consumed tokens: 57142149120 | elapsed time per iteration (s): 1.03 | learning rate: 2.767E-05 | global batch size: 256 | lm loss: 1.919361E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.605 | TFLOPs: 41.08 | 15: iteration 109000/ 125429 | consumed samples: 27904000 | consumed tokens: 57147392000 | elapsed time per iteration (s): 1.05 | learning rate: 2.766E-05 | global batch size: 256 | lm loss: 1.900091E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.230 | TFLOPs: 40.36 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 109000 | lm loss value: 1.910082E+00 | lm loss PPL: 6.753643E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 109000 to checkpoints_1b5 0: [2022-11-27 04:17:31,009] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step109000 is begin to save! 0: [2022-11-27 04:17:31,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:17:31,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:17:31,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:17:31,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:17:31,356] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:17:31,460] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:17:31,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:17:31,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:17:31,565] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:17:31,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:17:31,683] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:17:31,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:17:31,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:17:31,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:17:31,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:17:32,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:17:32,013] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:17:32,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:17:32,127] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:17:32,240] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:17:32,241] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:17:32,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:17:32,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:17:32,471] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:17:32,471] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:17:32,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:17:32,579] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:17:32,690] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:17:32,690] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:17:32,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:17:32,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:17:32,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:17:32,909] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:17:33,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:17:33,018] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:17:33,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:17:33,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:17:33,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:17:33,233] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:17:33,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:17:33,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:17:33,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:17:33,452] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:17:33,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:17:33,560] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:17:33,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:17:33,668] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:17:33,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:17:33,782] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:17:33,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:17:33,891] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:17:34,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:17:34,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:17:34,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:17:34,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_29-model_00-model_states.pt... 0: [2022-11-27 04:17:34,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_29-model_00-model_states.pt. 0: [2022-11-27 04:17:34,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:17:34,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:17:34,349] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/layer_32-model_00-model_states.pt... 0: [2022-11-27 04:17:34,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/layer_32-model_00-model_states.pt. 0: [2022-11-27 04:17:34,354] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step109000/mp_rank_00_model_states.pt 0: [2022-11-27 04:17:34,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:17:34,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:17:34,397] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step109000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:17:34,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:17:34,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:17:34,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-27 04:17:34,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-27 04:17:34,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:17:34,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-27 04:17:34,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,557] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:17:34,557] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 04:17:34,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:17:34,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:17:34,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:17:34,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:17:34,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:17:34,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:17:34,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 04:17:34,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:17:34,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:17:34,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:17:34,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:17:34,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 04:17:34,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,562] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:17:34,562] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:17:34,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 04:17:34,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:17:34,569] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:17:34,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:17:34,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-27 04:17:34,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:17:34,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,569] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,571] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,574] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,574] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 04:17:34,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-27 04:17:34,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,575] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,575] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:17:34,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-27 04:17:34,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-27 04:17:34,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-27 04:17:34,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,580] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,580] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-27 04:17:34,580] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:17:34,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-27 04:17:34,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,583] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-27 04:17:34,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:17:34,583] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-27 04:17:34,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:17:34,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,571] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,579] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 8: [2022-11-27 04:17:34,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:17:34,579] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,574] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:17:34,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,584] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,584] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:17:34,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-27 04:17:34,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-27 04:17:34,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:17:34,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:17:34,596] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,596] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 04:17:34,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:17:34,582] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,582] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:17:34,590] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,590] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:17:34,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-27 04:17:34,668] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,668] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,668] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:17:34,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,697] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 04:17:34,697] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-27 04:17:34,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:17:34,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: [2022-11-27 04:17:34,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:17:34,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 04:17:34,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 14: [2022-11-27 04:17:34,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:17:34,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:17:34,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:17:34,769] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,769] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 13: [2022-11-27 04:17:34,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:17:34,775] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:17:34,775] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:17:34,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 12: [2022-11-27 04:17:34,777] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:17:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 04:17:34,778] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 15: [2022-11-27 04:17:34,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:17:34,778] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:17:34,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 10: [2022-11-27 04:17:34,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:17:34,779] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 04:17:34,779] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,785] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:17:34,785] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,785] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 6: [2022-11-27 04:17:34,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:17:34,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:17:34,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-27 04:17:34,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,787] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 7: [2022-11-27 04:17:34,790] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:17:34,791] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:17:34,791] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 3: [2022-11-27 04:17:34,796] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:17:34,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:17:34,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 9: [2022-11-27 04:17:34,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:17:34,797] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 04:17:34,797] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 1: [2022-11-27 04:17:34,798] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:17:34,798] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:17:34,798] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 5: [2022-11-27 04:17:34,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:17:34,800] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 04:17:34,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 8: [2022-11-27 04:17:34,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:17:34,802] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:17:34,802] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 4: [2022-11-27 04:17:34,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:17:34,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:17:34,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 11: [2022-11-27 04:17:34,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:17:34,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:17:34,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 2: [2022-11-27 04:17:34,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:17:34,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step109000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:17:34,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step109000 is ready now! 0: successfully saved checkpoint at iteration 109000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3845.51 15: iteration 109010/ 125429 | consumed samples: 27906560 | consumed tokens: 57152634880 | elapsed time per iteration (s): 1.44 | learning rate: 2.765E-05 | global batch size: 256 | lm loss: 1.903965E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.438 | TFLOPs: 29.32 | 15: iteration 109020/ 125429 | consumed samples: 27909120 | consumed tokens: 57157877760 | elapsed time per iteration (s): 1.03 | learning rate: 2.764E-05 | global batch size: 256 | lm loss: 1.908696E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.280 | TFLOPs: 41.20 | 15: iteration 109030/ 125429 | consumed samples: 27911680 | consumed tokens: 57163120640 | elapsed time per iteration (s): 1.04 | learning rate: 2.764E-05 | global batch size: 256 | lm loss: 1.888907E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.185 | TFLOPs: 40.68 | 15: iteration 109040/ 125429 | consumed samples: 27914240 | consumed tokens: 57168363520 | elapsed time per iteration (s): 1.04 | learning rate: 2.763E-05 | global batch size: 256 | lm loss: 1.910817E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.067 | TFLOPs: 40.50 | 15: iteration 109050/ 125429 | consumed samples: 27916800 | consumed tokens: 57173606400 | elapsed time per iteration (s): 1.03 | learning rate: 2.762E-05 | global batch size: 256 | lm loss: 1.899403E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.080 | TFLOPs: 41.16 | 15: iteration 109060/ 125429 | consumed samples: 27919360 | consumed tokens: 57178849280 | elapsed time per iteration (s): 1.21 | learning rate: 2.761E-05 | global batch size: 256 | lm loss: 1.917568E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.094 | TFLOPs: 35.05 | 15: iteration 109070/ 125429 | consumed samples: 27921920 | consumed tokens: 57184092160 | elapsed time per iteration (s): 1.04 | learning rate: 2.760E-05 | global batch size: 256 | lm loss: 1.908388E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.634 | TFLOPs: 40.59 | 15: iteration 109080/ 125429 | consumed samples: 27924480 | consumed tokens: 57189335040 | elapsed time per iteration (s): 1.05 | learning rate: 2.759E-05 | global batch size: 256 | lm loss: 1.886365E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.953 | TFLOPs: 40.32 | 15: iteration 109090/ 125429 | consumed samples: 27927040 | consumed tokens: 57194577920 | elapsed time per iteration (s): 1.05 | learning rate: 2.758E-05 | global batch size: 256 | lm loss: 1.919732E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.603 | TFLOPs: 40.42 | 15: iteration 109100/ 125429 | consumed samples: 27929600 | consumed tokens: 57199820800 | elapsed time per iteration (s): 1.02 | learning rate: 2.757E-05 | global batch size: 256 | lm loss: 1.920804E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.932 | TFLOPs: 41.47 | 15: iteration 109110/ 125429 | consumed samples: 27932160 | consumed tokens: 57205063680 | elapsed time per iteration (s): 1.04 | learning rate: 2.756E-05 | global batch size: 256 | lm loss: 1.863209E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.174 | TFLOPs: 40.52 | 15: iteration 109120/ 125429 | consumed samples: 27934720 | consumed tokens: 57210306560 | elapsed time per iteration (s): 1.03 | learning rate: 2.755E-05 | global batch size: 256 | lm loss: 1.900865E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.630 | TFLOPs: 41.09 | 15: iteration 109130/ 125429 | consumed samples: 27937280 | consumed tokens: 57215549440 | elapsed time per iteration (s): 1.05 | learning rate: 2.754E-05 | global batch size: 256 | lm loss: 1.882295E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.486 | TFLOPs: 40.40 | 15: iteration 109140/ 125429 | consumed samples: 27939840 | consumed tokens: 57220792320 | elapsed time per iteration (s): 1.04 | learning rate: 2.753E-05 | global batch size: 256 | lm loss: 1.917135E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.477 | TFLOPs: 40.73 | 15: iteration 109150/ 125429 | consumed samples: 27942400 | consumed tokens: 57226035200 | elapsed time per iteration (s): 1.06 | learning rate: 2.753E-05 | global batch size: 256 | lm loss: 1.875075E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.820 | TFLOPs: 39.80 | 15: iteration 109160/ 125429 | consumed samples: 27944960 | consumed tokens: 57231278080 | elapsed time per iteration (s): 1.07 | learning rate: 2.752E-05 | global batch size: 256 | lm loss: 1.914238E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.454 | TFLOPs: 39.57 | 15: iteration 109170/ 125429 | consumed samples: 27947520 | consumed tokens: 57236520960 | elapsed time per iteration (s): 1.05 | learning rate: 2.751E-05 | global batch size: 256 | lm loss: 1.904930E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.504 | TFLOPs: 40.24 | 15: iteration 109180/ 125429 | consumed samples: 27950080 | consumed tokens: 57241763840 | elapsed time per iteration (s): 1.04 | learning rate: 2.750E-05 | global batch size: 256 | lm loss: 1.879176E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.226 | TFLOPs: 40.69 | 15: iteration 109190/ 125429 | consumed samples: 27952640 | consumed tokens: 57247006720 | elapsed time per iteration (s): 1.08 | learning rate: 2.749E-05 | global batch size: 256 | lm loss: 1.903505E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.628 | TFLOPs: 39.27 | 15: iteration 109200/ 125429 | consumed samples: 27955200 | consumed tokens: 57252249600 | elapsed time per iteration (s): 1.03 | learning rate: 2.748E-05 | global batch size: 256 | lm loss: 1.879965E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.748 | TFLOPs: 40.94 | 15: iteration 109210/ 125429 | consumed samples: 27957760 | consumed tokens: 57257492480 | elapsed time per iteration (s): 1.03 | learning rate: 2.747E-05 | global batch size: 256 | lm loss: 1.900292E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.508 | TFLOPs: 40.90 | 15: iteration 109220/ 125429 | consumed samples: 27960320 | consumed tokens: 57262735360 | elapsed time per iteration (s): 1.08 | learning rate: 2.746E-05 | global batch size: 256 | lm loss: 1.915291E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.555 | TFLOPs: 39.26 | 15: iteration 109230/ 125429 | consumed samples: 27962880 | consumed tokens: 57267978240 | elapsed time per iteration (s): 1.05 | learning rate: 2.745E-05 | global batch size: 256 | lm loss: 1.908753E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.924 | TFLOPs: 40.15 | 15: iteration 109240/ 125429 | consumed samples: 27965440 | consumed tokens: 57273221120 | elapsed time per iteration (s): 1.06 | learning rate: 2.744E-05 | global batch size: 256 | lm loss: 1.898656E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.640 | TFLOPs: 39.77 | 15: iteration 109250/ 125429 | consumed samples: 27968000 | consumed tokens: 57278464000 | elapsed time per iteration (s): 1.08 | learning rate: 2.743E-05 | global batch size: 256 | lm loss: 1.916916E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.745 | TFLOPs: 39.12 | 15: iteration 109260/ 125429 | consumed samples: 27970560 | consumed tokens: 57283706880 | elapsed time per iteration (s): 1.06 | learning rate: 2.743E-05 | global batch size: 256 | lm loss: 1.897976E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.589 | TFLOPs: 39.76 | 15: iteration 109270/ 125429 | consumed samples: 27973120 | consumed tokens: 57288949760 | elapsed time per iteration (s): 1.06 | learning rate: 2.742E-05 | global batch size: 256 | lm loss: 1.914344E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.326 | TFLOPs: 39.88 | 15: iteration 109280/ 125429 | consumed samples: 27975680 | consumed tokens: 57294192640 | elapsed time per iteration (s): 1.03 | learning rate: 2.741E-05 | global batch size: 256 | lm loss: 1.906728E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.785 | TFLOPs: 41.11 | 15: iteration 109290/ 125429 | consumed samples: 27978240 | consumed tokens: 57299435520 | elapsed time per iteration (s): 1.04 | learning rate: 2.740E-05 | global batch size: 256 | lm loss: 1.904626E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.024 | TFLOPs: 40.49 | 15: iteration 109300/ 125429 | consumed samples: 27980800 | consumed tokens: 57304678400 | elapsed time per iteration (s): 1.05 | learning rate: 2.739E-05 | global batch size: 256 | lm loss: 1.936829E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.139 | TFLOPs: 40.35 | 15: iteration 109310/ 125429 | consumed samples: 27983360 | consumed tokens: 57309921280 | elapsed time per iteration (s): 1.11 | learning rate: 2.738E-05 | global batch size: 256 | lm loss: 1.914715E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.069 | TFLOPs: 38.02 | 15: iteration 109320/ 125429 | consumed samples: 27985920 | consumed tokens: 57315164160 | elapsed time per iteration (s): 1.07 | learning rate: 2.737E-05 | global batch size: 256 | lm loss: 1.909011E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.579 | TFLOPs: 39.43 | 15: iteration 109330/ 125429 | consumed samples: 27988480 | consumed tokens: 57320407040 | elapsed time per iteration (s): 1.04 | learning rate: 2.736E-05 | global batch size: 256 | lm loss: 1.903188E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.437 | TFLOPs: 40.73 | 15: iteration 109340/ 125429 | consumed samples: 27991040 | consumed tokens: 57325649920 | elapsed time per iteration (s): 1.03 | learning rate: 2.735E-05 | global batch size: 256 | lm loss: 1.885834E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.659 | TFLOPs: 40.93 | 15: iteration 109350/ 125429 | consumed samples: 27993600 | consumed tokens: 57330892800 | elapsed time per iteration (s): 1.04 | learning rate: 2.734E-05 | global batch size: 256 | lm loss: 1.883337E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.180 | TFLOPs: 40.52 | 15: iteration 109360/ 125429 | consumed samples: 27996160 | consumed tokens: 57336135680 | elapsed time per iteration (s): 1.04 | learning rate: 2.734E-05 | global batch size: 256 | lm loss: 1.893367E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.847 | TFLOPs: 40.79 | 15: iteration 109370/ 125429 | consumed samples: 27998720 | consumed tokens: 57341378560 | elapsed time per iteration (s): 1.05 | learning rate: 2.733E-05 | global batch size: 256 | lm loss: 1.914578E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.106 | TFLOPs: 40.18 | 15: iteration 109380/ 125429 | consumed samples: 28001280 | consumed tokens: 57346621440 | elapsed time per iteration (s): 1.05 | learning rate: 2.732E-05 | global batch size: 256 | lm loss: 1.899444E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.586 | TFLOPs: 40.25 | 15: iteration 109390/ 125429 | consumed samples: 28003840 | consumed tokens: 57351864320 | elapsed time per iteration (s): 1.05 | learning rate: 2.731E-05 | global batch size: 256 | lm loss: 1.914652E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.503 | TFLOPs: 40.41 | 15: iteration 109400/ 125429 | consumed samples: 28006400 | consumed tokens: 57357107200 | elapsed time per iteration (s): 1.03 | learning rate: 2.730E-05 | global batch size: 256 | lm loss: 1.882184E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.677 | TFLOPs: 40.93 | 15: iteration 109410/ 125429 | consumed samples: 28008960 | consumed tokens: 57362350080 | elapsed time per iteration (s): 1.07 | learning rate: 2.729E-05 | global batch size: 256 | lm loss: 1.883680E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.144 | TFLOPs: 39.52 | 15: iteration 109420/ 125429 | consumed samples: 28011520 | consumed tokens: 57367592960 | elapsed time per iteration (s): 1.19 | learning rate: 2.728E-05 | global batch size: 256 | lm loss: 1.909804E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.463 | TFLOPs: 35.61 | 15: iteration 109430/ 125429 | consumed samples: 28014080 | consumed tokens: 57372835840 | elapsed time per iteration (s): 1.03 | learning rate: 2.727E-05 | global batch size: 256 | lm loss: 1.881001E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.208 | TFLOPs: 41.02 | 15: iteration 109440/ 125429 | consumed samples: 28016640 | consumed tokens: 57378078720 | elapsed time per iteration (s): 1.08 | learning rate: 2.726E-05 | global batch size: 256 | lm loss: 1.873876E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.426 | TFLOPs: 39.24 | 15: iteration 109450/ 125429 | consumed samples: 28019200 | consumed tokens: 57383321600 | elapsed time per iteration (s): 1.05 | learning rate: 2.725E-05 | global batch size: 256 | lm loss: 1.897858E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.180 | TFLOPs: 40.35 | 15: iteration 109460/ 125429 | consumed samples: 28021760 | consumed tokens: 57388564480 | elapsed time per iteration (s): 1.08 | learning rate: 2.725E-05 | global batch size: 256 | lm loss: 1.899359E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.726 | TFLOPs: 39.29 | 15: iteration 109470/ 125429 | consumed samples: 28024320 | consumed tokens: 57393807360 | elapsed time per iteration (s): 1.08 | learning rate: 2.724E-05 | global batch size: 256 | lm loss: 1.923894E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.093 | TFLOPs: 39.18 | 15: iteration 109480/ 125429 | consumed samples: 28026880 | consumed tokens: 57399050240 | elapsed time per iteration (s): 1.09 | learning rate: 2.723E-05 | global batch size: 256 | lm loss: 1.896142E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.854 | TFLOPs: 38.98 | 15: iteration 109490/ 125429 | consumed samples: 28029440 | consumed tokens: 57404293120 | elapsed time per iteration (s): 1.06 | learning rate: 2.722E-05 | global batch size: 256 | lm loss: 1.876853E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.760 | TFLOPs: 39.79 | 15: iteration 109500/ 125429 | consumed samples: 28032000 | consumed tokens: 57409536000 | elapsed time per iteration (s): 1.04 | learning rate: 2.721E-05 | global batch size: 256 | lm loss: 1.901100E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.551 | TFLOPs: 40.58 | 15: iteration 109510/ 125429 | consumed samples: 28034560 | consumed tokens: 57414778880 | elapsed time per iteration (s): 1.20 | learning rate: 2.720E-05 | global batch size: 256 | lm loss: 1.900014E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.849 | TFLOPs: 35.34 | 15: iteration 109520/ 125429 | consumed samples: 28037120 | consumed tokens: 57420021760 | elapsed time per iteration (s): 1.05 | learning rate: 2.719E-05 | global batch size: 256 | lm loss: 1.917797E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.295 | TFLOPs: 40.37 | 15: iteration 109530/ 125429 | consumed samples: 28039680 | consumed tokens: 57425264640 | elapsed time per iteration (s): 1.08 | learning rate: 2.718E-05 | global batch size: 256 | lm loss: 1.898476E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.735 | TFLOPs: 39.29 | 15: iteration 109540/ 125429 | consumed samples: 28042240 | consumed tokens: 57430507520 | elapsed time per iteration (s): 1.02 | learning rate: 2.717E-05 | global batch size: 256 | lm loss: 1.904484E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.555 | TFLOPs: 41.41 | 15: iteration 109550/ 125429 | consumed samples: 28044800 | consumed tokens: 57435750400 | elapsed time per iteration (s): 1.05 | learning rate: 2.717E-05 | global batch size: 256 | lm loss: 1.888543E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.837 | TFLOPs: 40.46 | 15: iteration 109560/ 125429 | consumed samples: 28047360 | consumed tokens: 57440993280 | elapsed time per iteration (s): 1.10 | learning rate: 2.716E-05 | global batch size: 256 | lm loss: 1.917067E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.559 | TFLOPs: 38.43 | 15: iteration 109570/ 125429 | consumed samples: 28049920 | consumed tokens: 57446236160 | elapsed time per iteration (s): 1.03 | learning rate: 2.715E-05 | global batch size: 256 | lm loss: 1.899659E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.452 | TFLOPs: 40.89 | 15: iteration 109580/ 125429 | consumed samples: 28052480 | consumed tokens: 57451479040 | elapsed time per iteration (s): 1.07 | learning rate: 2.714E-05 | global batch size: 256 | lm loss: 1.919647E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.977 | TFLOPs: 39.66 | 15: iteration 109590/ 125429 | consumed samples: 28055040 | consumed tokens: 57456721920 | elapsed time per iteration (s): 1.06 | learning rate: 2.713E-05 | global batch size: 256 | lm loss: 1.896434E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.518 | TFLOPs: 40.08 | 15: iteration 109600/ 125429 | consumed samples: 28057600 | consumed tokens: 57461964800 | elapsed time per iteration (s): 1.07 | learning rate: 2.712E-05 | global batch size: 256 | lm loss: 1.899648E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.302 | TFLOPs: 39.55 | 15: iteration 109610/ 125429 | consumed samples: 28060160 | consumed tokens: 57467207680 | elapsed time per iteration (s): 1.03 | learning rate: 2.711E-05 | global batch size: 256 | lm loss: 1.890362E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.759 | TFLOPs: 41.11 | 15: iteration 109620/ 125429 | consumed samples: 28062720 | consumed tokens: 57472450560 | elapsed time per iteration (s): 1.04 | learning rate: 2.710E-05 | global batch size: 256 | lm loss: 1.908392E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.809 | TFLOPs: 40.62 | 15: iteration 109630/ 125429 | consumed samples: 28065280 | consumed tokens: 57477693440 | elapsed time per iteration (s): 1.06 | learning rate: 2.709E-05 | global batch size: 256 | lm loss: 1.898077E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.334 | TFLOPs: 39.88 | 15: iteration 109640/ 125429 | consumed samples: 28067840 | consumed tokens: 57482936320 | elapsed time per iteration (s): 1.02 | learning rate: 2.709E-05 | global batch size: 256 | lm loss: 1.905419E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.877 | TFLOPs: 41.29 | 15: iteration 109650/ 125429 | consumed samples: 28070400 | consumed tokens: 57488179200 | elapsed time per iteration (s): 1.05 | learning rate: 2.708E-05 | global batch size: 256 | lm loss: 1.911957E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.407 | TFLOPs: 40.39 | 15: iteration 109660/ 125429 | consumed samples: 28072960 | consumed tokens: 57493422080 | elapsed time per iteration (s): 1.04 | learning rate: 2.707E-05 | global batch size: 256 | lm loss: 1.918898E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.807 | TFLOPs: 40.79 | 15: iteration 109670/ 125429 | consumed samples: 28075520 | consumed tokens: 57498664960 | elapsed time per iteration (s): 1.05 | learning rate: 2.706E-05 | global batch size: 256 | lm loss: 1.882350E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.609 | TFLOPs: 40.26 | 15: iteration 109680/ 125429 | consumed samples: 28078080 | consumed tokens: 57503907840 | elapsed time per iteration (s): 1.03 | learning rate: 2.705E-05 | global batch size: 256 | lm loss: 1.890307E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.496 | TFLOPs: 41.07 | 15: iteration 109690/ 125429 | consumed samples: 28080640 | consumed tokens: 57509150720 | elapsed time per iteration (s): 1.06 | learning rate: 2.704E-05 | global batch size: 256 | lm loss: 1.924458E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.493 | TFLOPs: 39.91 | 15: iteration 109700/ 125429 | consumed samples: 28083200 | consumed tokens: 57514393600 | elapsed time per iteration (s): 1.03 | learning rate: 2.703E-05 | global batch size: 256 | lm loss: 1.904988E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.213 | TFLOPs: 41.02 | 15: iteration 109710/ 125429 | consumed samples: 28085760 | consumed tokens: 57519636480 | elapsed time per iteration (s): 1.06 | learning rate: 2.702E-05 | global batch size: 256 | lm loss: 1.912744E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.691 | TFLOPs: 39.94 | 15: iteration 109720/ 125429 | consumed samples: 28088320 | consumed tokens: 57524879360 | elapsed time per iteration (s): 1.05 | learning rate: 2.701E-05 | global batch size: 256 | lm loss: 1.918925E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.549 | TFLOPs: 40.25 | 15: iteration 109730/ 125429 | consumed samples: 28090880 | consumed tokens: 57530122240 | elapsed time per iteration (s): 1.04 | learning rate: 2.701E-05 | global batch size: 256 | lm loss: 1.880696E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.905 | TFLOPs: 40.80 | 15: iteration 109740/ 125429 | consumed samples: 28093440 | consumed tokens: 57535365120 | elapsed time per iteration (s): 1.19 | learning rate: 2.700E-05 | global batch size: 256 | lm loss: 1.904870E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.688 | TFLOPs: 35.64 | 15: iteration 109750/ 125429 | consumed samples: 28096000 | consumed tokens: 57540608000 | elapsed time per iteration (s): 1.03 | learning rate: 2.699E-05 | global batch size: 256 | lm loss: 1.910791E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.667 | TFLOPs: 40.93 | 15: iteration 109760/ 125429 | consumed samples: 28098560 | consumed tokens: 57545850880 | elapsed time per iteration (s): 1.02 | learning rate: 2.698E-05 | global batch size: 256 | lm loss: 1.887702E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.063 | TFLOPs: 41.49 | 15: iteration 109770/ 125429 | consumed samples: 28101120 | consumed tokens: 57551093760 | elapsed time per iteration (s): 1.03 | learning rate: 2.697E-05 | global batch size: 256 | lm loss: 1.869320E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.431 | TFLOPs: 41.06 | 15: iteration 109780/ 125429 | consumed samples: 28103680 | consumed tokens: 57556336640 | elapsed time per iteration (s): 1.20 | learning rate: 2.696E-05 | global batch size: 256 | lm loss: 1.868856E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.113 | TFLOPs: 35.22 | 15: iteration 109790/ 125429 | consumed samples: 28106240 | consumed tokens: 57561579520 | elapsed time per iteration (s): 1.05 | learning rate: 2.695E-05 | global batch size: 256 | lm loss: 1.914277E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.120 | TFLOPs: 40.18 | 15: iteration 109800/ 125429 | consumed samples: 28108800 | consumed tokens: 57566822400 | elapsed time per iteration (s): 1.03 | learning rate: 2.694E-05 | global batch size: 256 | lm loss: 1.922251E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.543 | TFLOPs: 40.91 | 15: iteration 109810/ 125429 | consumed samples: 28111360 | consumed tokens: 57572065280 | elapsed time per iteration (s): 1.04 | learning rate: 2.694E-05 | global batch size: 256 | lm loss: 1.894342E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.179 | TFLOPs: 40.52 | 15: iteration 109820/ 125429 | consumed samples: 28113920 | consumed tokens: 57577308160 | elapsed time per iteration (s): 1.03 | learning rate: 2.693E-05 | global batch size: 256 | lm loss: 1.913415E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.881 | TFLOPs: 40.96 | 15: iteration 109830/ 125429 | consumed samples: 28116480 | consumed tokens: 57582551040 | elapsed time per iteration (s): 1.03 | learning rate: 2.692E-05 | global batch size: 256 | lm loss: 1.887279E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.379 | TFLOPs: 41.05 | 15: iteration 109840/ 125429 | consumed samples: 28119040 | consumed tokens: 57587793920 | elapsed time per iteration (s): 1.05 | learning rate: 2.691E-05 | global batch size: 256 | lm loss: 1.891869E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.662 | TFLOPs: 40.43 | 15: iteration 109850/ 125429 | consumed samples: 28121600 | consumed tokens: 57593036800 | elapsed time per iteration (s): 1.02 | learning rate: 2.690E-05 | global batch size: 256 | lm loss: 1.876864E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.517 | TFLOPs: 41.40 | 15: iteration 109860/ 125429 | consumed samples: 28124160 | consumed tokens: 57598279680 | elapsed time per iteration (s): 1.03 | learning rate: 2.689E-05 | global batch size: 256 | lm loss: 1.920859E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.669 | TFLOPs: 40.93 | 15: iteration 109870/ 125429 | consumed samples: 28126720 | consumed tokens: 57603522560 | elapsed time per iteration (s): 1.03 | learning rate: 2.688E-05 | global batch size: 256 | lm loss: 1.878856E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.479 | TFLOPs: 41.06 | 15: iteration 109880/ 125429 | consumed samples: 28129280 | consumed tokens: 57608765440 | elapsed time per iteration (s): 1.03 | learning rate: 2.687E-05 | global batch size: 256 | lm loss: 1.883899E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.118 | TFLOPs: 41.00 | 15: iteration 109890/ 125429 | consumed samples: 28131840 | consumed tokens: 57614008320 | elapsed time per iteration (s): 1.02 | learning rate: 2.687E-05 | global batch size: 256 | lm loss: 1.892187E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.795 | TFLOPs: 41.28 | 15: iteration 109900/ 125429 | consumed samples: 28134400 | consumed tokens: 57619251200 | elapsed time per iteration (s): 1.05 | learning rate: 2.686E-05 | global batch size: 256 | lm loss: 1.875632E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.642 | TFLOPs: 40.43 | 15: iteration 109910/ 125429 | consumed samples: 28136960 | consumed tokens: 57624494080 | elapsed time per iteration (s): 1.04 | learning rate: 2.685E-05 | global batch size: 256 | lm loss: 1.922448E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.126 | TFLOPs: 40.51 | 15: iteration 109920/ 125429 | consumed samples: 28139520 | consumed tokens: 57629736960 | elapsed time per iteration (s): 1.05 | learning rate: 2.684E-05 | global batch size: 256 | lm loss: 1.897969E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.696 | TFLOPs: 40.11 | 15: iteration 109930/ 125429 | consumed samples: 28142080 | consumed tokens: 57634979840 | elapsed time per iteration (s): 1.04 | learning rate: 2.683E-05 | global batch size: 256 | lm loss: 1.898991E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.182 | TFLOPs: 40.68 | 15: iteration 109940/ 125429 | consumed samples: 28144640 | consumed tokens: 57640222720 | elapsed time per iteration (s): 1.04 | learning rate: 2.682E-05 | global batch size: 256 | lm loss: 1.901683E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.187 | TFLOPs: 40.85 | 15: iteration 109950/ 125429 | consumed samples: 28147200 | consumed tokens: 57645465600 | elapsed time per iteration (s): 1.03 | learning rate: 2.681E-05 | global batch size: 256 | lm loss: 1.899538E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.347 | TFLOPs: 41.21 | 15: iteration 109960/ 125429 | consumed samples: 28149760 | consumed tokens: 57650708480 | elapsed time per iteration (s): 1.04 | learning rate: 2.680E-05 | global batch size: 256 | lm loss: 1.895873E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.326 | TFLOPs: 40.87 | 15: iteration 109970/ 125429 | consumed samples: 28152320 | consumed tokens: 57655951360 | elapsed time per iteration (s): 1.03 | learning rate: 2.680E-05 | global batch size: 256 | lm loss: 1.896399E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.879 | TFLOPs: 40.96 | 15: iteration 109980/ 125429 | consumed samples: 28154880 | consumed tokens: 57661194240 | elapsed time per iteration (s): 2.77 | learning rate: 2.679E-05 | global batch size: 256 | lm loss: 1.897503E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 92.483 | TFLOPs: 15.28 | 15: iteration 109990/ 125429 | consumed samples: 28157440 | consumed tokens: 57666437120 | elapsed time per iteration (s): 1.02 | learning rate: 2.678E-05 | global batch size: 256 | lm loss: 1.877703E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.490 | TFLOPs: 41.56 | 0: [2022-11-27 04:35:26,163] [INFO] [logging.py:68:log_dist] [Rank 0] step=110000, skipped=0, lr=[2.6770219657881785e-05, 2.6770219657881785e-05, 2.6770219657881785e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 110000/ 125429 | consumed samples: 28160000 | consumed tokens: 57671680000 | elapsed time per iteration (s): 1.03 | learning rate: 2.677E-05 | global batch size: 256 | lm loss: 1.894107E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.698 | TFLOPs: 41.10 | 0: steps: 110000 loss: 1.8438 iter time (s): 1.057 samples/sec: 242.213 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 110000 | lm loss value: 1.980232E+00 | lm loss PPL: 7.244423E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 110000 to checkpoints_1b5 0: [2022-11-27 04:35:26,523] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step110000 is begin to save! 0: [2022-11-27 04:35:26,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:35:26,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:35:26,798] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:35:26,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:35:26,906] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:35:27,025] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:35:27,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:35:27,145] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:35:27,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:35:27,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:35:27,258] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:35:27,376] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:35:27,376] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:35:27,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:35:27,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:35:27,610] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:35:27,610] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:35:27,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:35:27,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:35:27,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:35:27,842] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:35:27,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:35:27,954] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:35:28,068] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:35:28,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:35:28,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:35:28,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:35:28,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:35:28,301] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:35:28,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:35:28,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:35:28,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:35:28,534] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:35:28,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:35:28,645] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:35:28,755] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:35:28,756] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:35:28,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:35:28,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:35:28,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:35:28,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:35:29,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:35:29,103] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:35:29,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:35:29,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:35:29,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:35:29,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:35:29,439] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:35:29,440] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:35:29,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:35:29,550] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:35:29,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:35:29,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:35:29,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:35:29,773] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_29-model_00-model_states.pt... 0: [2022-11-27 04:35:29,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_29-model_00-model_states.pt. 0: [2022-11-27 04:35:29,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:35:29,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:35:29,999] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/layer_32-model_00-model_states.pt... 0: [2022-11-27 04:35:30,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/layer_32-model_00-model_states.pt. 0: [2022-11-27 04:35:30,004] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step110000/mp_rank_00_model_states.pt 0: [2022-11-27 04:35:30,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:35:30,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:35:30,046] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step110000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:35:30,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:35:30,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:35:30,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-27 04:35:30,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-27 04:35:30,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:35:30,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:35:30,215] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,215] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-27 04:35:30,220] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-27 04:35:30,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,222] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-27 04:35:30,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:35:30,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-27 04:35:30,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:35:30,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:35:30,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:35:30,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:35:30,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 04:35:30,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:35:30,229] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 04:35:30,229] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,223] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:35:30,223] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,223] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:35:30,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:35:30,232] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,232] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-27 04:35:30,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 04:35:30,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:35:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,219] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 04:35:30,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-27 04:35:30,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:35:30,221] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,222] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:35:30,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:35:30,224] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,224] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 1: [2022-11-27 04:35:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:35:30,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 04:35:30,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-27 04:35:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:35:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-27 04:35:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-27 04:35:30,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 7: [2022-11-27 04:35:30,241] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,241] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:35:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 04:35:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 12: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-27 04:35:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-27 04:35:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 3: [2022-11-27 04:35:30,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:35:30,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:35:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 10: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 6: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:35:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:35:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:35:30,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:35:30,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-27 04:35:30,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 4: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:35:30,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:35:30,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 04:35:30,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-27 04:35:30,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-27 04:35:30,214] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,214] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,214] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-27 04:35:30,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,227] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-27 04:35:30,227] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,227] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-27 04:35:30,228] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,228] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,228] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 11: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:35:30,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 15: [2022-11-27 04:35:30,250] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:35:30,250] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:35:30,250] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,263] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:35:30,264] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:35:30,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 2: [2022-11-27 04:35:30,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:35:30,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:35:30,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 14: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:35:30,247] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:35:30,247] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:35:30,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:35:30,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 11: [2022-11-27 04:35:30,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:35:30,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-27 04:35:30,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 13: [2022-11-27 04:35:30,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-27 04:35:30,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:35:30,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:35:30,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 8: [2022-11-27 04:35:30,244] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:35:30,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:35:30,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 9: [2022-11-27 04:35:30,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: [2022-11-27 04:35:30,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:35:30,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step110000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 5: [2022-11-27 04:35:30,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step110000 is ready now! 0: successfully saved checkpoint at iteration 110000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3970.30 15: iteration 110010/ 125429 | consumed samples: 28162560 | consumed tokens: 57676922880 | elapsed time per iteration (s): 1.45 | learning rate: 2.676E-05 | global batch size: 256 | lm loss: 1.908420E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.822 | TFLOPs: 29.22 | 15: iteration 110020/ 125429 | consumed samples: 28165120 | consumed tokens: 57682165760 | elapsed time per iteration (s): 1.04 | learning rate: 2.675E-05 | global batch size: 256 | lm loss: 1.918654E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.899 | TFLOPs: 40.64 | 15: iteration 110030/ 125429 | consumed samples: 28167680 | consumed tokens: 57687408640 | elapsed time per iteration (s): 1.03 | learning rate: 2.674E-05 | global batch size: 256 | lm loss: 1.903165E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.028 | TFLOPs: 41.15 | 15: iteration 110040/ 125429 | consumed samples: 28170240 | consumed tokens: 57692651520 | elapsed time per iteration (s): 1.02 | learning rate: 2.674E-05 | global batch size: 256 | lm loss: 1.919765E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.841 | TFLOPs: 41.45 | 15: iteration 110050/ 125429 | consumed samples: 28172800 | consumed tokens: 57697894400 | elapsed time per iteration (s): 1.03 | learning rate: 2.673E-05 | global batch size: 256 | lm loss: 1.909873E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.309 | TFLOPs: 41.03 | 15: iteration 110060/ 125429 | consumed samples: 28175360 | consumed tokens: 57703137280 | elapsed time per iteration (s): 1.03 | learning rate: 2.672E-05 | global batch size: 256 | lm loss: 1.888886E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.665 | TFLOPs: 40.93 | 15: iteration 110070/ 125429 | consumed samples: 28177920 | consumed tokens: 57708380160 | elapsed time per iteration (s): 1.03 | learning rate: 2.671E-05 | global batch size: 256 | lm loss: 1.885192E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.872 | TFLOPs: 41.13 | 15: iteration 110080/ 125429 | consumed samples: 28180480 | consumed tokens: 57713623040 | elapsed time per iteration (s): 1.06 | learning rate: 2.670E-05 | global batch size: 256 | lm loss: 1.890764E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.549 | TFLOPs: 39.92 | 15: iteration 110090/ 125429 | consumed samples: 28183040 | consumed tokens: 57718865920 | elapsed time per iteration (s): 1.41 | learning rate: 2.669E-05 | global batch size: 256 | lm loss: 1.898030E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 181.742 | TFLOPs: 30.03 | 15: iteration 110100/ 125429 | consumed samples: 28185600 | consumed tokens: 57724108800 | elapsed time per iteration (s): 1.04 | learning rate: 2.668E-05 | global batch size: 256 | lm loss: 1.882207E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.136 | TFLOPs: 40.68 | 15: iteration 110110/ 125429 | consumed samples: 28188160 | consumed tokens: 57729351680 | elapsed time per iteration (s): 1.05 | learning rate: 2.668E-05 | global batch size: 256 | lm loss: 1.877536E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.570 | TFLOPs: 40.25 | 15: iteration 110120/ 125429 | consumed samples: 28190720 | consumed tokens: 57734594560 | elapsed time per iteration (s): 1.02 | learning rate: 2.667E-05 | global batch size: 256 | lm loss: 1.875860E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.551 | TFLOPs: 41.57 | 15: iteration 110130/ 125429 | consumed samples: 28193280 | consumed tokens: 57739837440 | elapsed time per iteration (s): 1.04 | learning rate: 2.666E-05 | global batch size: 256 | lm loss: 1.889757E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.405 | TFLOPs: 40.55 | 15: iteration 110140/ 125429 | consumed samples: 28195840 | consumed tokens: 57745080320 | elapsed time per iteration (s): 1.06 | learning rate: 2.665E-05 | global batch size: 256 | lm loss: 1.913796E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.982 | TFLOPs: 39.99 | 15: iteration 110150/ 125429 | consumed samples: 28198400 | consumed tokens: 57750323200 | elapsed time per iteration (s): 1.10 | learning rate: 2.664E-05 | global batch size: 256 | lm loss: 1.851793E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.524 | TFLOPs: 38.43 | 15: iteration 110160/ 125429 | consumed samples: 28200960 | consumed tokens: 57755566080 | elapsed time per iteration (s): 1.04 | learning rate: 2.663E-05 | global batch size: 256 | lm loss: 1.894243E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.006 | TFLOPs: 40.65 | 15: iteration 110170/ 125429 | consumed samples: 28203520 | consumed tokens: 57760808960 | elapsed time per iteration (s): 1.04 | learning rate: 2.662E-05 | global batch size: 256 | lm loss: 1.894077E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.943 | TFLOPs: 40.81 | 15: iteration 110180/ 125429 | consumed samples: 28206080 | consumed tokens: 57766051840 | elapsed time per iteration (s): 1.03 | learning rate: 2.662E-05 | global batch size: 256 | lm loss: 1.911509E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.890 | TFLOPs: 41.13 | 15: iteration 110190/ 125429 | consumed samples: 28208640 | consumed tokens: 57771294720 | elapsed time per iteration (s): 1.04 | learning rate: 2.661E-05 | global batch size: 256 | lm loss: 1.903721E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.502 | TFLOPs: 40.57 | 15: iteration 110200/ 125429 | consumed samples: 28211200 | consumed tokens: 57776537600 | elapsed time per iteration (s): 1.03 | learning rate: 2.660E-05 | global batch size: 256 | lm loss: 1.896955E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.198 | TFLOPs: 41.18 | 15: iteration 110210/ 125429 | consumed samples: 28213760 | consumed tokens: 57781780480 | elapsed time per iteration (s): 1.04 | learning rate: 2.659E-05 | global batch size: 256 | lm loss: 1.887257E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.278 | TFLOPs: 40.86 | 15: iteration 110220/ 125429 | consumed samples: 28216320 | consumed tokens: 57787023360 | elapsed time per iteration (s): 1.03 | learning rate: 2.658E-05 | global batch size: 256 | lm loss: 1.905700E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.953 | TFLOPs: 41.14 | 15: iteration 110230/ 125429 | consumed samples: 28218880 | consumed tokens: 57792266240 | elapsed time per iteration (s): 1.03 | learning rate: 2.657E-05 | global batch size: 256 | lm loss: 1.903950E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.644 | TFLOPs: 41.09 | 15: iteration 110240/ 125429 | consumed samples: 28221440 | consumed tokens: 57797509120 | elapsed time per iteration (s): 1.05 | learning rate: 2.656E-05 | global batch size: 256 | lm loss: 1.899601E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.070 | TFLOPs: 40.33 | 15: iteration 110250/ 125429 | consumed samples: 28224000 | consumed tokens: 57802752000 | elapsed time per iteration (s): 1.04 | learning rate: 2.656E-05 | global batch size: 256 | lm loss: 1.902010E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.843 | TFLOPs: 40.63 | 15: iteration 110260/ 125429 | consumed samples: 28226560 | consumed tokens: 57807994880 | elapsed time per iteration (s): 1.08 | learning rate: 2.655E-05 | global batch size: 256 | lm loss: 1.884650E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.038 | TFLOPs: 39.34 | 15: iteration 110270/ 125429 | consumed samples: 28229120 | consumed tokens: 57813237760 | elapsed time per iteration (s): 1.05 | learning rate: 2.654E-05 | global batch size: 256 | lm loss: 1.894970E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.811 | TFLOPs: 40.46 | 15: iteration 110280/ 125429 | consumed samples: 28231680 | consumed tokens: 57818480640 | elapsed time per iteration (s): 1.03 | learning rate: 2.653E-05 | global batch size: 256 | lm loss: 1.894762E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.904 | TFLOPs: 41.13 | 15: iteration 110290/ 125429 | consumed samples: 28234240 | consumed tokens: 57823723520 | elapsed time per iteration (s): 1.04 | learning rate: 2.652E-05 | global batch size: 256 | lm loss: 1.890122E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.831 | TFLOPs: 40.63 | 15: iteration 110300/ 125429 | consumed samples: 28236800 | consumed tokens: 57828966400 | elapsed time per iteration (s): 1.06 | learning rate: 2.651E-05 | global batch size: 256 | lm loss: 1.905793E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.202 | TFLOPs: 40.03 | 15: iteration 110310/ 125429 | consumed samples: 28239360 | consumed tokens: 57834209280 | elapsed time per iteration (s): 1.02 | learning rate: 2.650E-05 | global batch size: 256 | lm loss: 1.896821E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.932 | TFLOPs: 41.63 | 15: iteration 110320/ 125429 | consumed samples: 28241920 | consumed tokens: 57839452160 | elapsed time per iteration (s): 1.02 | learning rate: 2.650E-05 | global batch size: 256 | lm loss: 1.862837E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.442 | TFLOPs: 41.55 | 15: iteration 110330/ 125429 | consumed samples: 28244480 | consumed tokens: 57844695040 | elapsed time per iteration (s): 1.02 | learning rate: 2.649E-05 | global batch size: 256 | lm loss: 1.896673E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.425 | TFLOPs: 41.38 | 15: iteration 110340/ 125429 | consumed samples: 28247040 | consumed tokens: 57849937920 | elapsed time per iteration (s): 1.03 | learning rate: 2.648E-05 | global batch size: 256 | lm loss: 1.903042E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.630 | TFLOPs: 40.92 | 15: iteration 110350/ 125429 | consumed samples: 28249600 | consumed tokens: 57855180800 | elapsed time per iteration (s): 1.03 | learning rate: 2.647E-05 | global batch size: 256 | lm loss: 1.894862E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.238 | TFLOPs: 41.02 | 15: iteration 110360/ 125429 | consumed samples: 28252160 | consumed tokens: 57860423680 | elapsed time per iteration (s): 1.03 | learning rate: 2.646E-05 | global batch size: 256 | lm loss: 1.918622E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.298 | TFLOPs: 41.20 | 15: iteration 110370/ 125429 | consumed samples: 28254720 | consumed tokens: 57865666560 | elapsed time per iteration (s): 1.04 | learning rate: 2.645E-05 | global batch size: 256 | lm loss: 1.910061E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.378 | TFLOPs: 40.55 | 15: iteration 110380/ 125429 | consumed samples: 28257280 | consumed tokens: 57870909440 | elapsed time per iteration (s): 1.20 | learning rate: 2.644E-05 | global batch size: 256 | lm loss: 1.912223E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.220 | TFLOPs: 35.24 | 15: iteration 110390/ 125429 | consumed samples: 28259840 | consumed tokens: 57876152320 | elapsed time per iteration (s): 1.07 | learning rate: 2.644E-05 | global batch size: 256 | lm loss: 1.902945E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.956 | TFLOPs: 39.65 | 15: iteration 110400/ 125429 | consumed samples: 28262400 | consumed tokens: 57881395200 | elapsed time per iteration (s): 1.03 | learning rate: 2.643E-05 | global batch size: 256 | lm loss: 1.892597E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.088 | TFLOPs: 41.16 | 15: iteration 110410/ 125429 | consumed samples: 28264960 | consumed tokens: 57886638080 | elapsed time per iteration (s): 1.03 | learning rate: 2.642E-05 | global batch size: 256 | lm loss: 1.917042E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.794 | TFLOPs: 40.95 | 15: iteration 110420/ 125429 | consumed samples: 28267520 | consumed tokens: 57891880960 | elapsed time per iteration (s): 1.02 | learning rate: 2.641E-05 | global batch size: 256 | lm loss: 1.899451E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.165 | TFLOPs: 41.34 | 15: iteration 110430/ 125429 | consumed samples: 28270080 | consumed tokens: 57897123840 | elapsed time per iteration (s): 1.03 | learning rate: 2.640E-05 | global batch size: 256 | lm loss: 1.882240E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.667 | TFLOPs: 41.26 | 15: iteration 110440/ 125429 | consumed samples: 28272640 | consumed tokens: 57902366720 | elapsed time per iteration (s): 1.03 | learning rate: 2.639E-05 | global batch size: 256 | lm loss: 1.881793E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.737 | TFLOPs: 41.27 | 15: iteration 110450/ 125429 | consumed samples: 28275200 | consumed tokens: 57907609600 | elapsed time per iteration (s): 1.03 | learning rate: 2.639E-05 | global batch size: 256 | lm loss: 1.884928E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.647 | TFLOPs: 41.09 | 15: iteration 110460/ 125429 | consumed samples: 28277760 | consumed tokens: 57912852480 | elapsed time per iteration (s): 1.03 | learning rate: 2.638E-05 | global batch size: 256 | lm loss: 1.910296E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.374 | TFLOPs: 41.21 | 15: iteration 110470/ 125429 | consumed samples: 28280320 | consumed tokens: 57918095360 | elapsed time per iteration (s): 1.04 | learning rate: 2.637E-05 | global batch size: 256 | lm loss: 1.881666E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.082 | TFLOPs: 40.83 | 15: iteration 110480/ 125429 | consumed samples: 28282880 | consumed tokens: 57923338240 | elapsed time per iteration (s): 1.18 | learning rate: 2.636E-05 | global batch size: 256 | lm loss: 1.898032E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.501 | TFLOPs: 35.78 | 15: iteration 110490/ 125429 | consumed samples: 28285440 | consumed tokens: 57928581120 | elapsed time per iteration (s): 1.18 | learning rate: 2.635E-05 | global batch size: 256 | lm loss: 1.878716E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.134 | TFLOPs: 35.72 | 15: iteration 110500/ 125429 | consumed samples: 28288000 | consumed tokens: 57933824000 | elapsed time per iteration (s): 1.23 | learning rate: 2.634E-05 | global batch size: 256 | lm loss: 1.895324E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 208.569 | TFLOPs: 34.47 | 15: iteration 110510/ 125429 | consumed samples: 28290560 | consumed tokens: 57939066880 | elapsed time per iteration (s): 1.05 | learning rate: 2.634E-05 | global batch size: 256 | lm loss: 1.879993E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.239 | TFLOPs: 40.20 | 15: iteration 110520/ 125429 | consumed samples: 28293120 | consumed tokens: 57944309760 | elapsed time per iteration (s): 1.04 | learning rate: 2.633E-05 | global batch size: 256 | lm loss: 1.865631E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.551 | TFLOPs: 40.74 | 15: iteration 110530/ 125429 | consumed samples: 28295680 | consumed tokens: 57949552640 | elapsed time per iteration (s): 1.03 | learning rate: 2.632E-05 | global batch size: 256 | lm loss: 1.883581E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.484 | TFLOPs: 41.23 | 15: iteration 110540/ 125429 | consumed samples: 28298240 | consumed tokens: 57954795520 | elapsed time per iteration (s): 1.04 | learning rate: 2.631E-05 | global batch size: 256 | lm loss: 1.895850E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.522 | TFLOPs: 40.74 | 15: iteration 110550/ 125429 | consumed samples: 28300800 | consumed tokens: 57960038400 | elapsed time per iteration (s): 1.06 | learning rate: 2.630E-05 | global batch size: 256 | lm loss: 1.881970E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.559 | TFLOPs: 39.75 | 15: iteration 110560/ 125429 | consumed samples: 28303360 | consumed tokens: 57965281280 | elapsed time per iteration (s): 1.03 | learning rate: 2.629E-05 | global batch size: 256 | lm loss: 1.912589E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.927 | TFLOPs: 41.14 | 15: iteration 110570/ 125429 | consumed samples: 28305920 | consumed tokens: 57970524160 | elapsed time per iteration (s): 1.04 | learning rate: 2.629E-05 | global batch size: 256 | lm loss: 1.901997E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.108 | TFLOPs: 40.51 | 15: iteration 110580/ 125429 | consumed samples: 28308480 | consumed tokens: 57975767040 | elapsed time per iteration (s): 1.03 | learning rate: 2.628E-05 | global batch size: 256 | lm loss: 1.891161E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.413 | TFLOPs: 41.05 | 15: iteration 110590/ 125429 | consumed samples: 28311040 | consumed tokens: 57981009920 | elapsed time per iteration (s): 1.03 | learning rate: 2.627E-05 | global batch size: 256 | lm loss: 1.883655E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.960 | TFLOPs: 41.14 | 15: iteration 110600/ 125429 | consumed samples: 28313600 | consumed tokens: 57986252800 | elapsed time per iteration (s): 1.05 | learning rate: 2.626E-05 | global batch size: 256 | lm loss: 1.896519E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.929 | TFLOPs: 40.31 | 15: iteration 110610/ 125429 | consumed samples: 28316160 | consumed tokens: 57991495680 | elapsed time per iteration (s): 1.03 | learning rate: 2.625E-05 | global batch size: 256 | lm loss: 1.893831E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.912 | TFLOPs: 40.97 | 15: iteration 110620/ 125429 | consumed samples: 28318720 | consumed tokens: 57996738560 | elapsed time per iteration (s): 1.03 | learning rate: 2.624E-05 | global batch size: 256 | lm loss: 1.906397E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.625 | TFLOPs: 41.25 | 15: iteration 110630/ 125429 | consumed samples: 28321280 | consumed tokens: 58001981440 | elapsed time per iteration (s): 1.06 | learning rate: 2.623E-05 | global batch size: 256 | lm loss: 1.884212E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.395 | TFLOPs: 40.06 | 15: iteration 110640/ 125429 | consumed samples: 28323840 | consumed tokens: 58007224320 | elapsed time per iteration (s): 1.04 | learning rate: 2.623E-05 | global batch size: 256 | lm loss: 1.901155E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.310 | TFLOPs: 40.70 | 15: iteration 110650/ 125429 | consumed samples: 28326400 | consumed tokens: 58012467200 | elapsed time per iteration (s): 1.06 | learning rate: 2.622E-05 | global batch size: 256 | lm loss: 1.912843E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.220 | TFLOPs: 39.86 | 15: iteration 110660/ 125429 | consumed samples: 28328960 | consumed tokens: 58017710080 | elapsed time per iteration (s): 1.04 | learning rate: 2.621E-05 | global batch size: 256 | lm loss: 1.897655E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.098 | TFLOPs: 40.67 | 15: iteration 110670/ 125429 | consumed samples: 28331520 | consumed tokens: 58022952960 | elapsed time per iteration (s): 1.04 | learning rate: 2.620E-05 | global batch size: 256 | lm loss: 1.897547E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.207 | TFLOPs: 40.52 | 15: iteration 110680/ 125429 | consumed samples: 28334080 | consumed tokens: 58028195840 | elapsed time per iteration (s): 1.07 | learning rate: 2.619E-05 | global batch size: 256 | lm loss: 1.888539E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.352 | TFLOPs: 39.39 | 15: iteration 110690/ 125429 | consumed samples: 28336640 | consumed tokens: 58033438720 | elapsed time per iteration (s): 1.03 | learning rate: 2.619E-05 | global batch size: 256 | lm loss: 1.918238E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.563 | TFLOPs: 40.91 | 15: iteration 110700/ 125429 | consumed samples: 28339200 | consumed tokens: 58038681600 | elapsed time per iteration (s): 1.03 | learning rate: 2.618E-05 | global batch size: 256 | lm loss: 1.890194E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.135 | TFLOPs: 41.17 | 15: iteration 110710/ 125429 | consumed samples: 28341760 | consumed tokens: 58043924480 | elapsed time per iteration (s): 1.13 | learning rate: 2.617E-05 | global batch size: 256 | lm loss: 1.901229E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.812 | TFLOPs: 37.32 | 15: iteration 110720/ 125429 | consumed samples: 28344320 | consumed tokens: 58049167360 | elapsed time per iteration (s): 1.04 | learning rate: 2.616E-05 | global batch size: 256 | lm loss: 1.915953E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.627 | TFLOPs: 40.59 | 15: iteration 110730/ 125429 | consumed samples: 28346880 | consumed tokens: 58054410240 | elapsed time per iteration (s): 1.04 | learning rate: 2.615E-05 | global batch size: 256 | lm loss: 1.896140E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.549 | TFLOPs: 40.74 | 15: iteration 110740/ 125429 | consumed samples: 28349440 | consumed tokens: 58059653120 | elapsed time per iteration (s): 1.03 | learning rate: 2.614E-05 | global batch size: 256 | lm loss: 1.906904E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.183 | TFLOPs: 41.01 | 15: iteration 110750/ 125429 | consumed samples: 28352000 | consumed tokens: 58064896000 | elapsed time per iteration (s): 1.03 | learning rate: 2.614E-05 | global batch size: 256 | lm loss: 1.877166E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.567 | TFLOPs: 40.91 | 15: iteration 110760/ 125429 | consumed samples: 28354560 | consumed tokens: 58070138880 | elapsed time per iteration (s): 1.06 | learning rate: 2.613E-05 | global batch size: 256 | lm loss: 1.894959E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.306 | TFLOPs: 39.88 | 15: iteration 110770/ 125429 | consumed samples: 28357120 | consumed tokens: 58075381760 | elapsed time per iteration (s): 1.04 | learning rate: 2.612E-05 | global batch size: 256 | lm loss: 1.884368E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.148 | TFLOPs: 40.51 | 15: iteration 110780/ 125429 | consumed samples: 28359680 | consumed tokens: 58080624640 | elapsed time per iteration (s): 1.04 | learning rate: 2.611E-05 | global batch size: 256 | lm loss: 1.888668E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.704 | TFLOPs: 40.60 | 15: iteration 110790/ 125429 | consumed samples: 28362240 | consumed tokens: 58085867520 | elapsed time per iteration (s): 1.05 | learning rate: 2.610E-05 | global batch size: 256 | lm loss: 1.909533E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.664 | TFLOPs: 40.43 | 15: iteration 110800/ 125429 | consumed samples: 28364800 | consumed tokens: 58091110400 | elapsed time per iteration (s): 1.05 | learning rate: 2.609E-05 | global batch size: 256 | lm loss: 1.873586E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.825 | TFLOPs: 40.29 | 15: iteration 110810/ 125429 | consumed samples: 28367360 | consumed tokens: 58096353280 | elapsed time per iteration (s): 1.04 | learning rate: 2.609E-05 | global batch size: 256 | lm loss: 1.896640E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.043 | TFLOPs: 40.50 | 15: iteration 110820/ 125429 | consumed samples: 28369920 | consumed tokens: 58101596160 | elapsed time per iteration (s): 1.02 | learning rate: 2.608E-05 | global batch size: 256 | lm loss: 1.873510E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.780 | TFLOPs: 41.28 | 15: iteration 110830/ 125429 | consumed samples: 28372480 | consumed tokens: 58106839040 | elapsed time per iteration (s): 1.03 | learning rate: 2.607E-05 | global batch size: 256 | lm loss: 1.900582E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.386 | TFLOPs: 41.05 | 15: iteration 110840/ 125429 | consumed samples: 28375040 | consumed tokens: 58112081920 | elapsed time per iteration (s): 1.04 | learning rate: 2.606E-05 | global batch size: 256 | lm loss: 1.908630E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.794 | TFLOPs: 40.62 | 15: iteration 110850/ 125429 | consumed samples: 28377600 | consumed tokens: 58117324800 | elapsed time per iteration (s): 1.05 | learning rate: 2.605E-05 | global batch size: 256 | lm loss: 1.907568E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.607 | TFLOPs: 40.42 | 15: iteration 110860/ 125429 | consumed samples: 28380160 | consumed tokens: 58122567680 | elapsed time per iteration (s): 1.07 | learning rate: 2.604E-05 | global batch size: 256 | lm loss: 1.906226E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.356 | TFLOPs: 39.56 | 15: iteration 110870/ 125429 | consumed samples: 28382720 | consumed tokens: 58127810560 | elapsed time per iteration (s): 1.05 | learning rate: 2.604E-05 | global batch size: 256 | lm loss: 1.881450E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.488 | TFLOPs: 40.40 | 15: iteration 110880/ 125429 | consumed samples: 28385280 | consumed tokens: 58133053440 | elapsed time per iteration (s): 1.07 | learning rate: 2.603E-05 | global batch size: 256 | lm loss: 1.901569E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.050 | TFLOPs: 39.50 | 15: iteration 110890/ 125429 | consumed samples: 28387840 | consumed tokens: 58138296320 | elapsed time per iteration (s): 1.03 | learning rate: 2.602E-05 | global batch size: 256 | lm loss: 1.894323E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.316 | TFLOPs: 41.20 | 15: iteration 110900/ 125429 | consumed samples: 28390400 | consumed tokens: 58143539200 | elapsed time per iteration (s): 1.06 | learning rate: 2.601E-05 | global batch size: 256 | lm loss: 1.884130E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.523 | TFLOPs: 39.75 | 15: iteration 110910/ 125429 | consumed samples: 28392960 | consumed tokens: 58148782080 | elapsed time per iteration (s): 1.05 | learning rate: 2.600E-05 | global batch size: 256 | lm loss: 1.901511E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.093 | TFLOPs: 40.34 | 15: iteration 110920/ 125429 | consumed samples: 28395520 | consumed tokens: 58154024960 | elapsed time per iteration (s): 1.03 | learning rate: 2.600E-05 | global batch size: 256 | lm loss: 1.904994E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.559 | TFLOPs: 41.08 | 15: iteration 110930/ 125429 | consumed samples: 28398080 | consumed tokens: 58159267840 | elapsed time per iteration (s): 1.04 | learning rate: 2.599E-05 | global batch size: 256 | lm loss: 1.867339E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.572 | TFLOPs: 40.75 | 15: iteration 110940/ 125429 | consumed samples: 28400640 | consumed tokens: 58164510720 | elapsed time per iteration (s): 1.03 | learning rate: 2.598E-05 | global batch size: 256 | lm loss: 1.889873E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.484 | TFLOPs: 41.06 | 15: iteration 110950/ 125429 | consumed samples: 28403200 | consumed tokens: 58169753600 | elapsed time per iteration (s): 1.05 | learning rate: 2.597E-05 | global batch size: 256 | lm loss: 1.865803E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.104 | TFLOPs: 40.17 | 15: iteration 110960/ 125429 | consumed samples: 28405760 | consumed tokens: 58174996480 | elapsed time per iteration (s): 1.03 | learning rate: 2.596E-05 | global batch size: 256 | lm loss: 1.898296E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.205 | TFLOPs: 41.02 | 15: iteration 110970/ 125429 | consumed samples: 28408320 | consumed tokens: 58180239360 | elapsed time per iteration (s): 1.05 | learning rate: 2.595E-05 | global batch size: 256 | lm loss: 1.946884E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.958 | TFLOPs: 40.32 | 15: iteration 110980/ 125429 | consumed samples: 28410880 | consumed tokens: 58185482240 | elapsed time per iteration (s): 1.03 | learning rate: 2.595E-05 | global batch size: 256 | lm loss: 1.907115E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.919 | TFLOPs: 40.97 | 15: iteration 110990/ 125429 | consumed samples: 28413440 | consumed tokens: 58190725120 | elapsed time per iteration (s): 1.07 | learning rate: 2.594E-05 | global batch size: 256 | lm loss: 1.905843E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.139 | TFLOPs: 39.52 | 15: iteration 111000/ 125429 | consumed samples: 28416000 | consumed tokens: 58195968000 | elapsed time per iteration (s): 1.06 | learning rate: 2.593E-05 | global batch size: 256 | lm loss: 1.897709E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.476 | TFLOPs: 40.07 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 111000 | lm loss value: 1.845954E+00 | lm loss PPL: 6.334139E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 111000 to checkpoints_1b5 0: [2022-11-27 04:53:02,166] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step111000 is begin to save! 0: [2022-11-27 04:53:02,176] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_01-model_00-model_states.pt... 0: [2022-11-27 04:53:02,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_01-model_00-model_states.pt. 0: [2022-11-27 04:53:02,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_03-model_00-model_states.pt... 0: [2022-11-27 04:53:02,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_03-model_00-model_states.pt. 0: [2022-11-27 04:53:02,544] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_04-model_00-model_states.pt... 0: [2022-11-27 04:53:02,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_04-model_00-model_states.pt. 0: [2022-11-27 04:53:02,659] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_05-model_00-model_states.pt... 0: [2022-11-27 04:53:02,767] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_05-model_00-model_states.pt. 0: [2022-11-27 04:53:02,768] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_06-model_00-model_states.pt... 0: [2022-11-27 04:53:02,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_06-model_00-model_states.pt. 0: [2022-11-27 04:53:02,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_07-model_00-model_states.pt... 0: [2022-11-27 04:53:02,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_07-model_00-model_states.pt. 0: [2022-11-27 04:53:02,987] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_08-model_00-model_states.pt... 0: [2022-11-27 04:53:03,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_08-model_00-model_states.pt. 0: [2022-11-27 04:53:03,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_09-model_00-model_states.pt... 0: [2022-11-27 04:53:03,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_09-model_00-model_states.pt. 0: [2022-11-27 04:53:03,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_10-model_00-model_states.pt... 0: [2022-11-27 04:53:03,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_10-model_00-model_states.pt. 0: [2022-11-27 04:53:03,318] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_11-model_00-model_states.pt... 0: [2022-11-27 04:53:03,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_11-model_00-model_states.pt. 0: [2022-11-27 04:53:03,433] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_12-model_00-model_states.pt... 0: [2022-11-27 04:53:03,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_12-model_00-model_states.pt. 0: [2022-11-27 04:53:03,548] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_13-model_00-model_states.pt... 0: [2022-11-27 04:53:03,663] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_13-model_00-model_states.pt. 0: [2022-11-27 04:53:03,663] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_14-model_00-model_states.pt... 0: [2022-11-27 04:53:03,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_14-model_00-model_states.pt. 0: [2022-11-27 04:53:03,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_15-model_00-model_states.pt... 0: [2022-11-27 04:53:03,892] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_15-model_00-model_states.pt. 0: [2022-11-27 04:53:03,893] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_16-model_00-model_states.pt... 0: [2022-11-27 04:53:04,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_16-model_00-model_states.pt. 0: [2022-11-27 04:53:04,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_17-model_00-model_states.pt... 0: [2022-11-27 04:53:04,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_17-model_00-model_states.pt. 0: [2022-11-27 04:53:04,122] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_18-model_00-model_states.pt... 0: [2022-11-27 04:53:04,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_18-model_00-model_states.pt. 0: [2022-11-27 04:53:04,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_19-model_00-model_states.pt... 0: [2022-11-27 04:53:04,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_19-model_00-model_states.pt. 0: [2022-11-27 04:53:04,348] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_20-model_00-model_states.pt... 0: [2022-11-27 04:53:04,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_20-model_00-model_states.pt. 0: [2022-11-27 04:53:04,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_21-model_00-model_states.pt... 0: [2022-11-27 04:53:04,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_21-model_00-model_states.pt. 0: [2022-11-27 04:53:04,573] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_22-model_00-model_states.pt... 0: [2022-11-27 04:53:04,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_22-model_00-model_states.pt. 0: [2022-11-27 04:53:04,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_23-model_00-model_states.pt... 0: [2022-11-27 04:53:04,793] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_23-model_00-model_states.pt. 0: [2022-11-27 04:53:04,794] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_24-model_00-model_states.pt... 0: [2022-11-27 04:53:04,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_24-model_00-model_states.pt. 0: [2022-11-27 04:53:04,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_25-model_00-model_states.pt... 0: [2022-11-27 04:53:05,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_25-model_00-model_states.pt. 0: [2022-11-27 04:53:05,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_26-model_00-model_states.pt... 0: [2022-11-27 04:53:05,125] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_26-model_00-model_states.pt. 0: [2022-11-27 04:53:05,126] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_27-model_00-model_states.pt... 0: [2022-11-27 04:53:05,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_27-model_00-model_states.pt. 0: [2022-11-27 04:53:05,234] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_28-model_00-model_states.pt... 0: [2022-11-27 04:53:05,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_28-model_00-model_states.pt. 0: [2022-11-27 04:53:05,346] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_29-model_00-model_states.pt... 0: [2022-11-27 04:53:05,454] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_29-model_00-model_states.pt. 0: [2022-11-27 04:53:05,454] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_30-model_00-model_states.pt... 0: [2022-11-27 04:53:05,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_30-model_00-model_states.pt. 0: [2022-11-27 04:53:05,568] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/layer_32-model_00-model_states.pt... 0: [2022-11-27 04:53:05,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/layer_32-model_00-model_states.pt. 0: [2022-11-27 04:53:05,570] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step111000/mp_rank_00_model_states.pt 0: [2022-11-27 04:53:05,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/mp_rank_00_model_states.pt... 0: [2022-11-27 04:53:05,575] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/mp_rank_00_model_states.pt. 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 9: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 6: [2022-11-27 04:53:05,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step111000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 0: [2022-11-27 04:53:05,769] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,776] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,776] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,784] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,786] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,786] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,787] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-27 04:53:05,788] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-27 04:53:05,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,789] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,789] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-27 04:53:05,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,794] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,794] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,784] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,784] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,795] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,796] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,796] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,799] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,800] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-27 04:53:05,805] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,806] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,811] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,811] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,813] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,813] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,813] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,814] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,814] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,814] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,815] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 04:53:05,815] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-27 04:53:05,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 0: [2022-11-27 04:53:05,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-27 04:53:05,818] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-27 04:53:05,818] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 11: [2022-11-27 04:53:05,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 04:53:05,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 04:53:05,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 04:53:05,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 04:53:05,808] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:53:05,812] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 04:53:05,812] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 04:53:05,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 04:53:05,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,823] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,823] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,830] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,830] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:53:05,820] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 04:53:05,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:53:05,821] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 04:53:05,821] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:53:05,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:53:05,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 1: [2022-11-27 04:53:05,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 04:53:05,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,807] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,807] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,807] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-27 04:53:05,810] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,810] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,810] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-27 04:53:05,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-27 04:53:05,819] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,820] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-27 04:53:05,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,822] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,822] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-27 04:53:05,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,827] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,827] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 14: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 1: [2022-11-27 04:53:05,831] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 04:53:05,831] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-27 04:53:05,833] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,833] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,833] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,834] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:53:05,834] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,834] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-27 04:53:05,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-27 04:53:05,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,837] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,837] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 9: [2022-11-27 04:53:05,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 9: [2022-11-27 04:53:05,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 9: [2022-11-27 04:53:05,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,838] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,838] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 04:53:05,838] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,839] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,839] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,839] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 3: [2022-11-27 04:53:05,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 04:53:05,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 04:53:05,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:53:05,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:53:05,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:53:05,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,842] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,842] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:53:05,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,848] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,848] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,854] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:53:05,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 2: [2022-11-27 04:53:05,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 04:53:05,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 04:53:05,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 12: [2022-11-27 04:53:05,856] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 04:53:05,856] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 04:53:05,856] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 04:53:05,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 04:53:05,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 13: [2022-11-27 04:53:05,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,825] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,826] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,826] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,835] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,835] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,835] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,836] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,836] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 4: [2022-11-27 04:53:05,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 04:53:05,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 04:53:05,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,871] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 7: [2022-11-27 04:53:05,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 7: [2022-11-27 04:53:05,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 7: [2022-11-27 04:53:05,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 5: [2022-11-27 04:53:05,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 04:53:05,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 04:53:05,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-27 04:53:05,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 8: [2022-11-27 04:53:05,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:53:05,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-27 04:53:05,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:53:05,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 6: [2022-11-27 04:53:05,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 04:53:05,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 04:53:05,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 8: [2022-11-27 04:53:05,885] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 04:53:05,885] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: [2022-11-27 04:53:05,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 04:53:05,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 04:53:05,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 04:53:05,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 10: [2022-11-27 04:53:05,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 15: [2022-11-27 04:53:05,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 04:53:05,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step111000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 04:53:05,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step111000 is ready now! 0: successfully saved checkpoint at iteration 111000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3850.41 15: iteration 111010/ 125429 | consumed samples: 28418560 | consumed tokens: 58201210880 | elapsed time per iteration (s): 1.45 | learning rate: 2.592E-05 | global batch size: 256 | lm loss: 1.922323E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 176.046 | TFLOPs: 29.09 | 15: iteration 111020/ 125429 | consumed samples: 28421120 | consumed tokens: 58206453760 | elapsed time per iteration (s): 1.10 | learning rate: 2.591E-05 | global batch size: 256 | lm loss: 1.911150E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.652 | TFLOPs: 38.45 | 15: iteration 111030/ 125429 | consumed samples: 28423680 | consumed tokens: 58211696640 | elapsed time per iteration (s): 1.05 | learning rate: 2.591E-05 | global batch size: 256 | lm loss: 1.907941E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.917 | TFLOPs: 40.14 | 15: iteration 111040/ 125429 | consumed samples: 28426240 | consumed tokens: 58216939520 | elapsed time per iteration (s): 1.05 | learning rate: 2.590E-05 | global batch size: 256 | lm loss: 1.896302E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.673 | TFLOPs: 40.43 | 15: iteration 111050/ 125429 | consumed samples: 28428800 | consumed tokens: 58222182400 | elapsed time per iteration (s): 1.03 | learning rate: 2.589E-05 | global batch size: 256 | lm loss: 1.868616E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.536 | TFLOPs: 40.91 | 15: iteration 111060/ 125429 | consumed samples: 28431360 | consumed tokens: 58227425280 | elapsed time per iteration (s): 1.03 | learning rate: 2.588E-05 | global batch size: 256 | lm loss: 1.871888E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.892 | TFLOPs: 40.97 | 15: iteration 111070/ 125429 | consumed samples: 28433920 | consumed tokens: 58232668160 | elapsed time per iteration (s): 1.05 | learning rate: 2.587E-05 | global batch size: 256 | lm loss: 1.888105E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.471 | TFLOPs: 40.40 | 15: iteration 111080/ 125429 | consumed samples: 28436480 | consumed tokens: 58237911040 | elapsed time per iteration (s): 1.04 | learning rate: 2.587E-05 | global batch size: 256 | lm loss: 1.891501E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.632 | TFLOPs: 40.76 | 15: iteration 111090/ 125429 | consumed samples: 28439040 | consumed tokens: 58243153920 | elapsed time per iteration (s): 1.06 | learning rate: 2.586E-05 | global batch size: 256 | lm loss: 1.903867E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.212 | TFLOPs: 40.03 | 15: iteration 111100/ 125429 | consumed samples: 28441600 | consumed tokens: 58248396800 | elapsed time per iteration (s): 2.88 | learning rate: 2.585E-05 | global batch size: 256 | lm loss: 1.923142E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 88.986 | TFLOPs: 14.71 | 15: iteration 111110/ 125429 | consumed samples: 28444160 | consumed tokens: 58253639680 | elapsed time per iteration (s): 1.02 | learning rate: 2.584E-05 | global batch size: 256 | lm loss: 1.908072E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.281 | TFLOPs: 41.36 | 15: iteration 111120/ 125429 | consumed samples: 28446720 | consumed tokens: 58258882560 | elapsed time per iteration (s): 1.02 | learning rate: 2.583E-05 | global batch size: 256 | lm loss: 1.853714E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.043 | TFLOPs: 41.49 | 15: iteration 111130/ 125429 | consumed samples: 28449280 | consumed tokens: 58264125440 | elapsed time per iteration (s): 1.04 | learning rate: 2.583E-05 | global batch size: 256 | lm loss: 1.903248E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.233 | TFLOPs: 40.53 | 15: iteration 111140/ 125429 | consumed samples: 28451840 | consumed tokens: 58269368320 | elapsed time per iteration (s): 1.02 | learning rate: 2.582E-05 | global batch size: 256 | lm loss: 1.907228E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.941 | TFLOPs: 41.30 | 15: iteration 111150/ 125429 | consumed samples: 28454400 | consumed tokens: 58274611200 | elapsed time per iteration (s): 1.04 | learning rate: 2.581E-05 | global batch size: 256 | lm loss: 1.889020E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.833 | TFLOPs: 40.79 | 15: iteration 111160/ 125429 | consumed samples: 28456960 | consumed tokens: 58279854080 | elapsed time per iteration (s): 1.05 | learning rate: 2.580E-05 | global batch size: 256 | lm loss: 1.871941E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.876 | TFLOPs: 40.14 | 15: iteration 111170/ 125429 | consumed samples: 28459520 | consumed tokens: 58285096960 | elapsed time per iteration (s): 1.04 | learning rate: 2.579E-05 | global batch size: 256 | lm loss: 1.921906E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.604 | TFLOPs: 40.75 | 15: iteration 111180/ 125429 | consumed samples: 28462080 | consumed tokens: 58290339840 | elapsed time per iteration (s): 1.05 | learning rate: 2.579E-05 | global batch size: 256 | lm loss: 1.886854E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.884 | TFLOPs: 40.47 | 15: iteration 111190/ 125429 | consumed samples: 28464640 | consumed tokens: 58295582720 | elapsed time per iteration (s): 1.05 | learning rate: 2.578E-05 | global batch size: 256 | lm loss: 1.874764E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.178 | TFLOPs: 40.35 | 15: iteration 111200/ 125429 | consumed samples: 28467200 | consumed tokens: 58300825600 | elapsed time per iteration (s): 1.07 | learning rate: 2.577E-05 | global batch size: 256 | lm loss: 1.907927E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.730 | TFLOPs: 39.45 | 15: iteration 111210/ 125429 | consumed samples: 28469760 | consumed tokens: 58306068480 | elapsed time per iteration (s): 1.07 | learning rate: 2.576E-05 | global batch size: 256 | lm loss: 1.862717E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.007 | TFLOPs: 39.50 | 15: iteration 111220/ 125429 | consumed samples: 28472320 | consumed tokens: 58311311360 | elapsed time per iteration (s): 1.08 | learning rate: 2.575E-05 | global batch size: 256 | lm loss: 1.913374E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.752 | TFLOPs: 39.13 | 15: iteration 111230/ 125429 | consumed samples: 28474880 | consumed tokens: 58316554240 | elapsed time per iteration (s): 1.03 | learning rate: 2.574E-05 | global batch size: 256 | lm loss: 1.886706E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.585 | TFLOPs: 41.25 | 15: iteration 111240/ 125429 | consumed samples: 28477440 | consumed tokens: 58321797120 | elapsed time per iteration (s): 1.03 | learning rate: 2.574E-05 | global batch size: 256 | lm loss: 1.925797E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.672 | TFLOPs: 40.93 | 15: iteration 111250/ 125429 | consumed samples: 28480000 | consumed tokens: 58327040000 | elapsed time per iteration (s): 1.03 | learning rate: 2.573E-05 | global batch size: 256 | lm loss: 1.887046E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.397 | TFLOPs: 40.88 | 15: iteration 111260/ 125429 | consumed samples: 28482560 | consumed tokens: 58332282880 | elapsed time per iteration (s): 1.04 | learning rate: 2.572E-05 | global batch size: 256 | lm loss: 1.890226E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.615 | TFLOPs: 40.76 | 15: iteration 111270/ 125429 | consumed samples: 28485120 | consumed tokens: 58337525760 | elapsed time per iteration (s): 1.04 | learning rate: 2.571E-05 | global batch size: 256 | lm loss: 1.884318E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.923 | TFLOPs: 40.81 | 15: iteration 111280/ 125429 | consumed samples: 28487680 | consumed tokens: 58342768640 | elapsed time per iteration (s): 1.03 | learning rate: 2.571E-05 | global batch size: 256 | lm loss: 1.886567E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.421 | TFLOPs: 41.05 | 15: iteration 111290/ 125429 | consumed samples: 28490240 | consumed tokens: 58348011520 | elapsed time per iteration (s): 1.06 | learning rate: 2.570E-05 | global batch size: 256 | lm loss: 1.870869E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.106 | TFLOPs: 39.84 | 15: iteration 111300/ 125429 | consumed samples: 28492800 | consumed tokens: 58353254400 | elapsed time per iteration (s): 1.05 | learning rate: 2.569E-05 | global batch size: 256 | lm loss: 1.903486E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.238 | TFLOPs: 40.36 | 15: iteration 111310/ 125429 | consumed samples: 28495360 | consumed tokens: 58358497280 | elapsed time per iteration (s): 1.08 | learning rate: 2.568E-05 | global batch size: 256 | lm loss: 1.877228E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.003 | TFLOPs: 39.00 | 15: iteration 111320/ 125429 | consumed samples: 28497920 | consumed tokens: 58363740160 | elapsed time per iteration (s): 1.04 | learning rate: 2.567E-05 | global batch size: 256 | lm loss: 1.884349E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.033 | TFLOPs: 40.66 | 15: iteration 111330/ 125429 | consumed samples: 28500480 | consumed tokens: 58368983040 | elapsed time per iteration (s): 1.03 | learning rate: 2.567E-05 | global batch size: 256 | lm loss: 1.892465E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.119 | TFLOPs: 41.17 | 15: iteration 111340/ 125429 | consumed samples: 28503040 | consumed tokens: 58374225920 | elapsed time per iteration (s): 1.06 | learning rate: 2.566E-05 | global batch size: 256 | lm loss: 1.896248E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.384 | TFLOPs: 39.73 | 15: iteration 111350/ 125429 | consumed samples: 28505600 | consumed tokens: 58379468800 | elapsed time per iteration (s): 1.07 | learning rate: 2.565E-05 | global batch size: 256 | lm loss: 1.890675E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.222 | TFLOPs: 39.53 | 15: iteration 111360/ 125429 | consumed samples: 28508160 | consumed tokens: 58384711680 | elapsed time per iteration (s): 1.07 | learning rate: 2.564E-05 | global batch size: 256 | lm loss: 1.909107E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.249 | TFLOPs: 39.37 | 15: iteration 111370/ 125429 | consumed samples: 28510720 | consumed tokens: 58389954560 | elapsed time per iteration (s): 1.06 | learning rate: 2.563E-05 | global batch size: 256 | lm loss: 1.896130E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.810 | TFLOPs: 39.96 | 15: iteration 111380/ 125429 | consumed samples: 28513280 | consumed tokens: 58395197440 | elapsed time per iteration (s): 1.04 | learning rate: 2.563E-05 | global batch size: 256 | lm loss: 1.915495E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.234 | TFLOPs: 40.69 | 15: iteration 111390/ 125429 | consumed samples: 28515840 | consumed tokens: 58400440320 | elapsed time per iteration (s): 1.07 | learning rate: 2.562E-05 | global batch size: 256 | lm loss: 1.906362E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.127 | TFLOPs: 39.68 | 15: iteration 111400/ 125429 | consumed samples: 28518400 | consumed tokens: 58405683200 | elapsed time per iteration (s): 1.04 | learning rate: 2.561E-05 | global batch size: 256 | lm loss: 1.871033E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.254 | TFLOPs: 40.53 | 15: iteration 111410/ 125429 | consumed samples: 28520960 | consumed tokens: 58410926080 | elapsed time per iteration (s): 1.03 | learning rate: 2.560E-05 | global batch size: 256 | lm loss: 1.890638E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.124 | TFLOPs: 41.17 | 15: iteration 111420/ 125429 | consumed samples: 28523520 | consumed tokens: 58416168960 | elapsed time per iteration (s): 1.23 | learning rate: 2.559E-05 | global batch size: 256 | lm loss: 1.884721E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 208.152 | TFLOPs: 34.40 | 15: iteration 111430/ 125429 | consumed samples: 28526080 | consumed tokens: 58421411840 | elapsed time per iteration (s): 1.04 | learning rate: 2.559E-05 | global batch size: 256 | lm loss: 1.928248E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.668 | TFLOPs: 40.60 | 15: iteration 111440/ 125429 | consumed samples: 28528640 | consumed tokens: 58426654720 | elapsed time per iteration (s): 1.04 | learning rate: 2.558E-05 | global batch size: 256 | lm loss: 1.881227E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.331 | TFLOPs: 40.87 | 15: iteration 111450/ 125429 | consumed samples: 28531200 | consumed tokens: 58431897600 | elapsed time per iteration (s): 1.04 | learning rate: 2.557E-05 | global batch size: 256 | lm loss: 1.886275E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.088 | TFLOPs: 40.67 | 15: iteration 111460/ 125429 | consumed samples: 28533760 | consumed tokens: 58437140480 | elapsed time per iteration (s): 1.04 | learning rate: 2.556E-05 | global batch size: 256 | lm loss: 1.885796E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.161 | TFLOPs: 40.51 | 15: iteration 111470/ 125429 | consumed samples: 28536320 | consumed tokens: 58442383360 | elapsed time per iteration (s): 1.03 | learning rate: 2.555E-05 | global batch size: 256 | lm loss: 1.908378E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.441 | TFLOPs: 41.06 | 15: iteration 111480/ 125429 | consumed samples: 28538880 | consumed tokens: 58447626240 | elapsed time per iteration (s): 1.05 | learning rate: 2.555E-05 | global batch size: 256 | lm loss: 1.922425E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.716 | TFLOPs: 40.44 | 15: iteration 111490/ 125429 | consumed samples: 28541440 | consumed tokens: 58452869120 | elapsed time per iteration (s): 1.05 | learning rate: 2.554E-05 | global batch size: 256 | lm loss: 1.891260E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.401 | TFLOPs: 40.22 | 15: iteration 111500/ 125429 | consumed samples: 28544000 | consumed tokens: 58458112000 | elapsed time per iteration (s): 1.04 | learning rate: 2.553E-05 | global batch size: 256 | lm loss: 1.927214E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.698 | TFLOPs: 40.77 | 15: iteration 111510/ 125429 | consumed samples: 28546560 | consumed tokens: 58463354880 | elapsed time per iteration (s): 1.07 | learning rate: 2.552E-05 | global batch size: 256 | lm loss: 1.917905E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.223 | TFLOPs: 39.70 | 15: iteration 111520/ 125429 | consumed samples: 28549120 | consumed tokens: 58468597760 | elapsed time per iteration (s): 1.05 | learning rate: 2.552E-05 | global batch size: 256 | lm loss: 1.885743E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.154 | TFLOPs: 40.35 | 15: iteration 111530/ 125429 | consumed samples: 28551680 | consumed tokens: 58473840640 | elapsed time per iteration (s): 1.03 | learning rate: 2.551E-05 | global batch size: 256 | lm loss: 1.881679E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.588 | TFLOPs: 40.92 | 15: iteration 111540/ 125429 | consumed samples: 28554240 | consumed tokens: 58479083520 | elapsed time per iteration (s): 1.07 | learning rate: 2.550E-05 | global batch size: 256 | lm loss: 1.887644E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.844 | TFLOPs: 39.47 | 15: iteration 111550/ 125429 | consumed samples: 28556800 | consumed tokens: 58484326400 | elapsed time per iteration (s): 1.06 | learning rate: 2.549E-05 | global batch size: 256 | lm loss: 1.894917E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.556 | TFLOPs: 39.75 | 15: iteration 111560/ 125429 | consumed samples: 28559360 | consumed tokens: 58489569280 | elapsed time per iteration (s): 1.03 | learning rate: 2.548E-05 | global batch size: 256 | lm loss: 1.892500E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.049 | TFLOPs: 40.99 | 15: iteration 111570/ 125429 | consumed samples: 28561920 | consumed tokens: 58494812160 | elapsed time per iteration (s): 1.03 | learning rate: 2.548E-05 | global batch size: 256 | lm loss: 1.919319E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.381 | TFLOPs: 40.88 | 15: iteration 111580/ 125429 | consumed samples: 28564480 | consumed tokens: 58500055040 | elapsed time per iteration (s): 1.06 | learning rate: 2.547E-05 | global batch size: 256 | lm loss: 1.883113E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.610 | TFLOPs: 39.93 | 15: iteration 111590/ 125429 | consumed samples: 28567040 | consumed tokens: 58505297920 | elapsed time per iteration (s): 1.04 | learning rate: 2.546E-05 | global batch size: 256 | lm loss: 1.893071E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.497 | TFLOPs: 40.74 | 15: iteration 111600/ 125429 | consumed samples: 28569600 | consumed tokens: 58510540800 | elapsed time per iteration (s): 1.03 | learning rate: 2.545E-05 | global batch size: 256 | lm loss: 1.904974E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.391 | TFLOPs: 40.88 | 15: iteration 111610/ 125429 | consumed samples: 28572160 | consumed tokens: 58515783680 | elapsed time per iteration (s): 1.04 | learning rate: 2.544E-05 | global batch size: 256 | lm loss: 1.902150E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.775 | TFLOPs: 40.78 | 15: iteration 111620/ 125429 | consumed samples: 28574720 | consumed tokens: 58521026560 | elapsed time per iteration (s): 1.02 | learning rate: 2.544E-05 | global batch size: 256 | lm loss: 1.894841E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.108 | TFLOPs: 41.50 | 15: iteration 111630/ 125429 | consumed samples: 28577280 | consumed tokens: 58526269440 | elapsed time per iteration (s): 1.09 | learning rate: 2.543E-05 | global batch size: 256 | lm loss: 1.878141E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.928 | TFLOPs: 38.66 | 15: iteration 111640/ 125429 | consumed samples: 28579840 | consumed tokens: 58531512320 | elapsed time per iteration (s): 1.06 | learning rate: 2.542E-05 | global batch size: 256 | lm loss: 1.905048E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.625 | TFLOPs: 39.77 | 15: iteration 111650/ 125429 | consumed samples: 28582400 | consumed tokens: 58536755200 | elapsed time per iteration (s): 1.06 | learning rate: 2.541E-05 | global batch size: 256 | lm loss: 1.891108E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.265 | TFLOPs: 40.04 | 15: iteration 111660/ 125429 | consumed samples: 28584960 | consumed tokens: 58541998080 | elapsed time per iteration (s): 1.04 | learning rate: 2.541E-05 | global batch size: 256 | lm loss: 1.851685E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.990 | TFLOPs: 40.65 | 15: iteration 111670/ 125429 | consumed samples: 28587520 | consumed tokens: 58547240960 | elapsed time per iteration (s): 1.04 | learning rate: 2.540E-05 | global batch size: 256 | lm loss: 1.890787E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.718 | TFLOPs: 40.77 | 15: iteration 111680/ 125429 | consumed samples: 28590080 | consumed tokens: 58552483840 | elapsed time per iteration (s): 1.07 | learning rate: 2.539E-05 | global batch size: 256 | lm loss: 1.914212E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.143 | TFLOPs: 39.69 | 15: iteration 111690/ 125429 | consumed samples: 28592640 | consumed tokens: 58557726720 | elapsed time per iteration (s): 1.05 | learning rate: 2.538E-05 | global batch size: 256 | lm loss: 1.889322E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.558 | TFLOPs: 40.42 | 15: iteration 111700/ 125429 | consumed samples: 28595200 | consumed tokens: 58562969600 | elapsed time per iteration (s): 1.06 | learning rate: 2.537E-05 | global batch size: 256 | lm loss: 1.882951E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.025 | TFLOPs: 39.83 | 15: iteration 111710/ 125429 | consumed samples: 28597760 | consumed tokens: 58568212480 | elapsed time per iteration (s): 1.10 | learning rate: 2.537E-05 | global batch size: 256 | lm loss: 1.885672E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.691 | TFLOPs: 38.45 | 15: iteration 111720/ 125429 | consumed samples: 28600320 | consumed tokens: 58573455360 | elapsed time per iteration (s): 1.08 | learning rate: 2.536E-05 | global batch size: 256 | lm loss: 1.895486E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.095 | TFLOPs: 39.18 | 15: iteration 111730/ 125429 | consumed samples: 28602880 | consumed tokens: 58578698240 | elapsed time per iteration (s): 1.04 | learning rate: 2.535E-05 | global batch size: 256 | lm loss: 1.904897E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.553 | TFLOPs: 40.74 | 15: iteration 111740/ 125429 | consumed samples: 28605440 | consumed tokens: 58583941120 | elapsed time per iteration (s): 1.08 | learning rate: 2.534E-05 | global batch size: 256 | lm loss: 1.901219E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.098 | TFLOPs: 39.35 | 15: iteration 111750/ 125429 | consumed samples: 28608000 | consumed tokens: 58589184000 | elapsed time per iteration (s): 1.06 | learning rate: 2.534E-05 | global batch size: 256 | lm loss: 1.912760E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.872 | TFLOPs: 39.97 | 15: iteration 111760/ 125429 | consumed samples: 28610560 | consumed tokens: 58594426880 | elapsed time per iteration (s): 1.03 | learning rate: 2.533E-05 | global batch size: 256 | lm loss: 1.894925E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.721 | TFLOPs: 40.94 | 15: iteration 111770/ 125429 | consumed samples: 28613120 | consumed tokens: 58599669760 | elapsed time per iteration (s): 1.06 | learning rate: 2.532E-05 | global batch size: 256 | lm loss: 1.931991E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.500 | TFLOPs: 40.08 | 15: iteration 111780/ 125429 | consumed samples: 28615680 | consumed tokens: 58604912640 | elapsed time per iteration (s): 1.03 | learning rate: 2.531E-05 | global batch size: 256 | lm loss: 1.881473E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.216 | TFLOPs: 41.02 | 15: iteration 111790/ 125429 | consumed samples: 28618240 | consumed tokens: 58610155520 | elapsed time per iteration (s): 1.04 | learning rate: 2.531E-05 | global batch size: 256 | lm loss: 1.884909E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.010 | TFLOPs: 40.49 | 15: iteration 111800/ 125429 | consumed samples: 28620800 | consumed tokens: 58615398400 | elapsed time per iteration (s): 1.06 | learning rate: 2.530E-05 | global batch size: 256 | lm loss: 1.861087E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.239 | TFLOPs: 39.87 | 15: iteration 111810/ 125429 | consumed samples: 28623360 | consumed tokens: 58620641280 | elapsed time per iteration (s): 1.04 | learning rate: 2.529E-05 | global batch size: 256 | lm loss: 1.907330E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.470 | TFLOPs: 40.73 | 15: iteration 111820/ 125429 | consumed samples: 28625920 | consumed tokens: 58625884160 | elapsed time per iteration (s): 1.09 | learning rate: 2.528E-05 | global batch size: 256 | lm loss: 1.914048E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.828 | TFLOPs: 38.97 | 15: iteration 111830/ 125429 | consumed samples: 28628480 | consumed tokens: 58631127040 | elapsed time per iteration (s): 1.05 | learning rate: 2.527E-05 | global batch size: 256 | lm loss: 1.885187E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.475 | TFLOPs: 40.24 | 15: iteration 111840/ 125429 | consumed samples: 28631040 | consumed tokens: 58636369920 | elapsed time per iteration (s): 1.08 | learning rate: 2.527E-05 | global batch size: 256 | lm loss: 1.884357E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.749 | TFLOPs: 39.12 | 15: iteration 111850/ 125429 | consumed samples: 28633600 | consumed tokens: 58641612800 | elapsed time per iteration (s): 1.06 | learning rate: 2.526E-05 | global batch size: 256 | lm loss: 1.856426E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.397 | TFLOPs: 40.06 | 15: iteration 111860/ 125429 | consumed samples: 28636160 | consumed tokens: 58646855680 | elapsed time per iteration (s): 1.07 | learning rate: 2.525E-05 | global batch size: 256 | lm loss: 1.898265E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.529 | TFLOPs: 39.42 | 15: iteration 111870/ 125429 | consumed samples: 28638720 | consumed tokens: 58652098560 | elapsed time per iteration (s): 1.08 | learning rate: 2.524E-05 | global batch size: 256 | lm loss: 1.896738E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.772 | TFLOPs: 39.29 | 15: iteration 111880/ 125429 | consumed samples: 28641280 | consumed tokens: 58657341440 | elapsed time per iteration (s): 1.06 | learning rate: 2.524E-05 | global batch size: 256 | lm loss: 1.868572E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.137 | TFLOPs: 40.01 | 15: iteration 111890/ 125429 | consumed samples: 28643840 | consumed tokens: 58662584320 | elapsed time per iteration (s): 1.07 | learning rate: 2.523E-05 | global batch size: 256 | lm loss: 1.906042E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.973 | TFLOPs: 39.49 | 15: iteration 111900/ 125429 | consumed samples: 28646400 | consumed tokens: 58667827200 | elapsed time per iteration (s): 1.04 | learning rate: 2.522E-05 | global batch size: 256 | lm loss: 1.876660E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.928 | TFLOPs: 40.81 | 15: iteration 111910/ 125429 | consumed samples: 28648960 | consumed tokens: 58673070080 | elapsed time per iteration (s): 1.09 | learning rate: 2.521E-05 | global batch size: 256 | lm loss: 1.899599E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.790 | TFLOPs: 38.64 | 15: iteration 111920/ 125429 | consumed samples: 28651520 | consumed tokens: 58678312960 | elapsed time per iteration (s): 1.07 | learning rate: 2.521E-05 | global batch size: 256 | lm loss: 1.875790E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.794 | TFLOPs: 39.63 | 15: iteration 111930/ 125429 | consumed samples: 28654080 | consumed tokens: 58683555840 | elapsed time per iteration (s): 1.05 | learning rate: 2.520E-05 | global batch size: 256 | lm loss: 1.908121E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.818 | TFLOPs: 40.13 | 15: iteration 111940/ 125429 | consumed samples: 28656640 | consumed tokens: 58688798720 | elapsed time per iteration (s): 1.09 | learning rate: 2.519E-05 | global batch size: 256 | lm loss: 1.915801E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.987 | TFLOPs: 38.83 | 15: iteration 111950/ 125429 | consumed samples: 28659200 | consumed tokens: 58694041600 | elapsed time per iteration (s): 1.07 | learning rate: 2.518E-05 | global batch size: 256 | lm loss: 1.920006E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.432 | TFLOPs: 39.57 | 15: iteration 111960/ 125429 | consumed samples: 28661760 | consumed tokens: 58699284480 | elapsed time per iteration (s): 1.05 | learning rate: 2.518E-05 | global batch size: 256 | lm loss: 1.895975E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.428 | TFLOPs: 40.39 | 15: iteration 111970/ 125429 | consumed samples: 28664320 | consumed tokens: 58704527360 | elapsed time per iteration (s): 1.05 | learning rate: 2.517E-05 | global batch size: 256 | lm loss: 1.903053E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.984 | TFLOPs: 40.15 | 15: iteration 111980/ 125429 | consumed samples: 28666880 | consumed tokens: 58709770240 | elapsed time per iteration (s): 1.07 | learning rate: 2.516E-05 | global batch size: 256 | lm loss: 1.861744E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.695 | TFLOPs: 39.45 | 15: iteration 111990/ 125429 | consumed samples: 28669440 | consumed tokens: 58715013120 | elapsed time per iteration (s): 1.04 | learning rate: 2.515E-05 | global batch size: 256 | lm loss: 1.909092E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.206 | TFLOPs: 40.69 | 0: [2022-11-27 05:10:58,205] [INFO] [logging.py:68:log_dist] [Rank 0] step=112000, skipped=0, lr=[2.514464061435642e-05, 2.514464061435642e-05, 2.514464061435642e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 112000/ 125429 | consumed samples: 28672000 | consumed tokens: 58720256000 | elapsed time per iteration (s): 1.07 | learning rate: 2.514E-05 | global batch size: 256 | lm loss: 1.901764E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.831 | TFLOPs: 39.47 | 0: steps: 112000 loss: 1.8822 iter time (s): 1.059 samples/sec: 241.698 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 112000 | lm loss value: 1.855712E+00 | lm loss PPL: 6.396250E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 112000 to checkpoints_1b5 0: [2022-11-27 05:10:58,556] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step112000 is begin to save! 0: [2022-11-27 05:10:58,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:10:58,801] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:10:58,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:10:58,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:10:58,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:10:59,005] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:10:59,006] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:10:59,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:10:59,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:10:59,213] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:10:59,214] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:10:59,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:10:59,322] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:10:59,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:10:59,428] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:10:59,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:10:59,540] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:10:59,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:10:59,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:10:59,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:10:59,751] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:10:59,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:10:59,854] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:10:59,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:10:59,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:11:00,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:11:00,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:11:00,172] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:11:00,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:11:00,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:11:00,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:11:00,385] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:11:00,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:11:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:11:00,488] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:11:00,593] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:11:00,593] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:11:00,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:11:00,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:11:00,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:11:00,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:11:00,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:11:00,903] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:11:01,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:11:01,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:11:01,111] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:11:01,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:11:01,215] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:11:01,216] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:11:01,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:11:01,319] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:11:01,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:11:01,424] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:11:01,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:11:01,531] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_29-model_00-model_states.pt... 0: [2022-11-27 05:11:01,637] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_29-model_00-model_states.pt. 0: [2022-11-27 05:11:01,637] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:11:01,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:11:01,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/layer_32-model_00-model_states.pt... 0: [2022-11-27 05:11:01,746] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/layer_32-model_00-model_states.pt. 0: [2022-11-27 05:11:01,747] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step112000/mp_rank_00_model_states.pt 0: [2022-11-27 05:11:01,747] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:11:01,750] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/mp_rank_00_model_states.pt. 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:11:01,789] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step112000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:11:01,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:01,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:01,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 05:11:01,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 05:11:01,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-27 05:11:01,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:01,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:01,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:01,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 05:11:01,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-27 05:11:01,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 05:11:01,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-27 05:11:01,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-27 05:11:01,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-27 05:11:01,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:01,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:01,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:01,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 2: [2022-11-27 05:11:01,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:01,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-27 05:11:01,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:01,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:01,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-27 05:11:01,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:11:01,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:11:01,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 15: [2022-11-27 05:11:01,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:01,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:11:01,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:01,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:11:01,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 5: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:11:01,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:11:01,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:01,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:01,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 9: [2022-11-27 05:11:01,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:01,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,975] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,975] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 6: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:11:01,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 05:11:01,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:01,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:01,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:01,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:01,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:01,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:01,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:11:01,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:01,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 9: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:11:01,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:11:01,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 05:11:01,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:01,982] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 05:11:01,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:01,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 05:11:01,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:11:01,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:11:01,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,979] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,984] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:01,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,979] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,984] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:01,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,984] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-27 05:11:01,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:01,986] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:01,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:01,986] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:01,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-27 05:11:01,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 05:11:01,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-27 05:11:01,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:11:01,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 7: [2022-11-27 05:11:01,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:11:01,994] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 05:11:01,994] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 05:11:01,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 15: [2022-11-27 05:11:01,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:11:01,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 05:11:01,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 10: [2022-11-27 05:11:01,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:11:01,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 05:11:01,999] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 14: [2022-11-27 05:11:01,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:11:01,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 05:11:01,985] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 2: [2022-11-27 05:11:02,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:11:02,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:11:02,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 12: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:02,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:02,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-27 05:11:02,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 8: [2022-11-27 05:11:02,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:11:02,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 05:11:02,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 11: [2022-11-27 05:11:01,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:11:01,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 05:11:01,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 3: [2022-11-27 05:11:02,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:11:02,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:11:02,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:11:02,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 13: [2022-11-27 05:11:02,036] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: [2022-11-27 05:11:02,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 05:11:02,073] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,103] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,103] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,103] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,106] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:11:02,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,107] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 1: [2022-11-27 05:11:02,107] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 4: [2022-11-27 05:11:02,122] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step112000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 05:11:02,122] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step112000 is ready now! 0: successfully saved checkpoint at iteration 112000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3585.44 15: iteration 112010/ 125429 | consumed samples: 28674560 | consumed tokens: 58725498880 | elapsed time per iteration (s): 1.42 | learning rate: 2.514E-05 | global batch size: 256 | lm loss: 1.903539E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 179.689 | TFLOPs: 29.69 | 15: iteration 112020/ 125429 | consumed samples: 28677120 | consumed tokens: 58730741760 | elapsed time per iteration (s): 1.04 | learning rate: 2.513E-05 | global batch size: 256 | lm loss: 1.904481E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.277 | TFLOPs: 40.86 | 15: iteration 112030/ 125429 | consumed samples: 28679680 | consumed tokens: 58735984640 | elapsed time per iteration (s): 1.19 | learning rate: 2.512E-05 | global batch size: 256 | lm loss: 1.918201E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.395 | TFLOPs: 35.43 | 15: iteration 112040/ 125429 | consumed samples: 28682240 | consumed tokens: 58741227520 | elapsed time per iteration (s): 1.06 | learning rate: 2.511E-05 | global batch size: 256 | lm loss: 1.891501E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.600 | TFLOPs: 40.09 | 15: iteration 112050/ 125429 | consumed samples: 28684800 | consumed tokens: 58746470400 | elapsed time per iteration (s): 1.06 | learning rate: 2.511E-05 | global batch size: 256 | lm loss: 1.894991E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.496 | TFLOPs: 39.91 | 15: iteration 112060/ 125429 | consumed samples: 28687360 | consumed tokens: 58751713280 | elapsed time per iteration (s): 1.04 | learning rate: 2.510E-05 | global batch size: 256 | lm loss: 1.922855E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.908 | TFLOPs: 40.80 | 15: iteration 112070/ 125429 | consumed samples: 28689920 | consumed tokens: 58756956160 | elapsed time per iteration (s): 1.05 | learning rate: 2.509E-05 | global batch size: 256 | lm loss: 1.934404E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.734 | TFLOPs: 40.44 | 15: iteration 112080/ 125429 | consumed samples: 28692480 | consumed tokens: 58762199040 | elapsed time per iteration (s): 1.07 | learning rate: 2.508E-05 | global batch size: 256 | lm loss: 1.877794E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.127 | TFLOPs: 39.68 | 15: iteration 112090/ 125429 | consumed samples: 28695040 | consumed tokens: 58767441920 | elapsed time per iteration (s): 1.08 | learning rate: 2.508E-05 | global batch size: 256 | lm loss: 1.899623E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.493 | TFLOPs: 39.08 | 15: iteration 112100/ 125429 | consumed samples: 28697600 | consumed tokens: 58772684800 | elapsed time per iteration (s): 1.04 | learning rate: 2.507E-05 | global batch size: 256 | lm loss: 1.894404E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.403 | TFLOPs: 40.55 | 15: iteration 112110/ 125429 | consumed samples: 28700160 | consumed tokens: 58777927680 | elapsed time per iteration (s): 1.07 | learning rate: 2.506E-05 | global batch size: 256 | lm loss: 1.878144E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.761 | TFLOPs: 39.62 | 15: iteration 112120/ 125429 | consumed samples: 28702720 | consumed tokens: 58783170560 | elapsed time per iteration (s): 1.06 | learning rate: 2.505E-05 | global batch size: 256 | lm loss: 1.873766E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.848 | TFLOPs: 39.97 | 15: iteration 112130/ 125429 | consumed samples: 28705280 | consumed tokens: 58788413440 | elapsed time per iteration (s): 1.07 | learning rate: 2.505E-05 | global batch size: 256 | lm loss: 1.906334E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.292 | TFLOPs: 39.71 | 15: iteration 112140/ 125429 | consumed samples: 28707840 | consumed tokens: 58793656320 | elapsed time per iteration (s): 1.07 | learning rate: 2.504E-05 | global batch size: 256 | lm loss: 1.914434E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.542 | TFLOPs: 39.59 | 15: iteration 112150/ 125429 | consumed samples: 28710400 | consumed tokens: 58798899200 | elapsed time per iteration (s): 1.03 | learning rate: 2.503E-05 | global batch size: 256 | lm loss: 1.897659E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.720 | TFLOPs: 40.94 | 15: iteration 112160/ 125429 | consumed samples: 28712960 | consumed tokens: 58804142080 | elapsed time per iteration (s): 1.07 | learning rate: 2.502E-05 | global batch size: 256 | lm loss: 1.915971E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.570 | TFLOPs: 39.59 | 15: iteration 112170/ 125429 | consumed samples: 28715520 | consumed tokens: 58809384960 | elapsed time per iteration (s): 1.03 | learning rate: 2.502E-05 | global batch size: 256 | lm loss: 1.902031E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.879 | TFLOPs: 40.96 | 15: iteration 112180/ 125429 | consumed samples: 28718080 | consumed tokens: 58814627840 | elapsed time per iteration (s): 1.03 | learning rate: 2.501E-05 | global batch size: 256 | lm loss: 1.904500E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.150 | TFLOPs: 41.17 | 15: iteration 112190/ 125429 | consumed samples: 28720640 | consumed tokens: 58819870720 | elapsed time per iteration (s): 1.13 | learning rate: 2.500E-05 | global batch size: 256 | lm loss: 1.884853E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.115 | TFLOPs: 37.53 | 15: iteration 112200/ 125429 | consumed samples: 28723200 | consumed tokens: 58825113600 | elapsed time per iteration (s): 1.02 | learning rate: 2.499E-05 | global batch size: 256 | lm loss: 1.921306E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.585 | TFLOPs: 41.41 | 15: iteration 112210/ 125429 | consumed samples: 28725760 | consumed tokens: 58830356480 | elapsed time per iteration (s): 1.04 | learning rate: 2.499E-05 | global batch size: 256 | lm loss: 1.878895E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.202 | TFLOPs: 40.69 | 15: iteration 112220/ 125429 | consumed samples: 28728320 | consumed tokens: 58835599360 | elapsed time per iteration (s): 1.05 | learning rate: 2.498E-05 | global batch size: 256 | lm loss: 1.903882E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.553 | TFLOPs: 40.41 | 15: iteration 112230/ 125429 | consumed samples: 28730880 | consumed tokens: 58840842240 | elapsed time per iteration (s): 1.04 | learning rate: 2.497E-05 | global batch size: 256 | lm loss: 1.907296E+00 | grad norm: 0.149 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.357 | TFLOPs: 40.55 | 15: iteration 112240/ 125429 | consumed samples: 28733440 | consumed tokens: 58846085120 | elapsed time per iteration (s): 1.03 | learning rate: 2.496E-05 | global batch size: 256 | lm loss: 1.864816E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.095 | TFLOPs: 41.00 | 15: iteration 112250/ 125429 | consumed samples: 28736000 | consumed tokens: 58851328000 | elapsed time per iteration (s): 1.06 | learning rate: 2.496E-05 | global batch size: 256 | lm loss: 1.883762E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.645 | TFLOPs: 39.93 | 15: iteration 112260/ 125429 | consumed samples: 28738560 | consumed tokens: 58856570880 | elapsed time per iteration (s): 1.03 | learning rate: 2.495E-05 | global batch size: 256 | lm loss: 1.929600E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.539 | TFLOPs: 41.07 | 15: iteration 112270/ 125429 | consumed samples: 28741120 | consumed tokens: 58861813760 | elapsed time per iteration (s): 1.04 | learning rate: 2.494E-05 | global batch size: 256 | lm loss: 1.887925E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.100 | TFLOPs: 40.84 | 15: iteration 112280/ 125429 | consumed samples: 28743680 | consumed tokens: 58867056640 | elapsed time per iteration (s): 1.05 | learning rate: 2.493E-05 | global batch size: 256 | lm loss: 1.908509E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.631 | TFLOPs: 40.43 | 15: iteration 112290/ 125429 | consumed samples: 28746240 | consumed tokens: 58872299520 | elapsed time per iteration (s): 1.05 | learning rate: 2.493E-05 | global batch size: 256 | lm loss: 1.906664E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.445 | TFLOPs: 40.23 | 15: iteration 112300/ 125429 | consumed samples: 28748800 | consumed tokens: 58877542400 | elapsed time per iteration (s): 1.03 | learning rate: 2.492E-05 | global batch size: 256 | lm loss: 1.889734E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.698 | TFLOPs: 41.10 | 15: iteration 112310/ 125429 | consumed samples: 28751360 | consumed tokens: 58882785280 | elapsed time per iteration (s): 1.03 | learning rate: 2.491E-05 | global batch size: 256 | lm loss: 1.904588E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.769 | TFLOPs: 41.11 | 15: iteration 112320/ 125429 | consumed samples: 28753920 | consumed tokens: 58888028160 | elapsed time per iteration (s): 1.05 | learning rate: 2.490E-05 | global batch size: 256 | lm loss: 1.876990E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.144 | TFLOPs: 40.18 | 15: iteration 112330/ 125429 | consumed samples: 28756480 | consumed tokens: 58893271040 | elapsed time per iteration (s): 1.02 | learning rate: 2.490E-05 | global batch size: 256 | lm loss: 1.883316E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.700 | TFLOPs: 41.43 | 15: iteration 112340/ 125429 | consumed samples: 28759040 | consumed tokens: 58898513920 | elapsed time per iteration (s): 1.04 | learning rate: 2.489E-05 | global batch size: 256 | lm loss: 1.911291E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.312 | TFLOPs: 40.87 | 15: iteration 112350/ 125429 | consumed samples: 28761600 | consumed tokens: 58903756800 | elapsed time per iteration (s): 1.06 | learning rate: 2.488E-05 | global batch size: 256 | lm loss: 1.868066E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.972 | TFLOPs: 39.82 | 15: iteration 112360/ 125429 | consumed samples: 28764160 | consumed tokens: 58908999680 | elapsed time per iteration (s): 1.07 | learning rate: 2.487E-05 | global batch size: 256 | lm loss: 1.880073E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.544 | TFLOPs: 39.59 | 15: iteration 112370/ 125429 | consumed samples: 28766720 | consumed tokens: 58914242560 | elapsed time per iteration (s): 1.03 | learning rate: 2.487E-05 | global batch size: 256 | lm loss: 1.902533E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.407 | TFLOPs: 41.05 | 15: iteration 112380/ 125429 | consumed samples: 28769280 | consumed tokens: 58919485440 | elapsed time per iteration (s): 1.03 | learning rate: 2.486E-05 | global batch size: 256 | lm loss: 1.914186E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.696 | TFLOPs: 41.10 | 15: iteration 112390/ 125429 | consumed samples: 28771840 | consumed tokens: 58924728320 | elapsed time per iteration (s): 1.04 | learning rate: 2.485E-05 | global batch size: 256 | lm loss: 1.885447E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.772 | TFLOPs: 40.78 | 15: iteration 112400/ 125429 | consumed samples: 28774400 | consumed tokens: 58929971200 | elapsed time per iteration (s): 1.06 | learning rate: 2.485E-05 | global batch size: 256 | lm loss: 1.889097E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.584 | TFLOPs: 40.09 | 15: iteration 112410/ 125429 | consumed samples: 28776960 | consumed tokens: 58935214080 | elapsed time per iteration (s): 1.06 | learning rate: 2.484E-05 | global batch size: 256 | lm loss: 1.885909E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.449 | TFLOPs: 39.90 | 15: iteration 112420/ 125429 | consumed samples: 28779520 | consumed tokens: 58940456960 | elapsed time per iteration (s): 1.03 | learning rate: 2.483E-05 | global batch size: 256 | lm loss: 1.916522E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.389 | TFLOPs: 40.88 | 15: iteration 112430/ 125429 | consumed samples: 28782080 | consumed tokens: 58945699840 | elapsed time per iteration (s): 1.08 | learning rate: 2.482E-05 | global batch size: 256 | lm loss: 1.878292E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.892 | TFLOPs: 39.31 | 15: iteration 112440/ 125429 | consumed samples: 28784640 | consumed tokens: 58950942720 | elapsed time per iteration (s): 1.04 | learning rate: 2.482E-05 | global batch size: 256 | lm loss: 1.872271E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.702 | TFLOPs: 40.77 | 15: iteration 112450/ 125429 | consumed samples: 28787200 | consumed tokens: 58956185600 | elapsed time per iteration (s): 1.06 | learning rate: 2.481E-05 | global batch size: 256 | lm loss: 1.864581E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.950 | TFLOPs: 39.82 | 15: iteration 112460/ 125429 | consumed samples: 28789760 | consumed tokens: 58961428480 | elapsed time per iteration (s): 1.05 | learning rate: 2.480E-05 | global batch size: 256 | lm loss: 1.893262E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.366 | TFLOPs: 40.22 | 15: iteration 112470/ 125429 | consumed samples: 28792320 | consumed tokens: 58966671360 | elapsed time per iteration (s): 1.04 | learning rate: 2.479E-05 | global batch size: 256 | lm loss: 1.910673E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.567 | TFLOPs: 40.58 | 15: iteration 112480/ 125429 | consumed samples: 28794880 | consumed tokens: 58971914240 | elapsed time per iteration (s): 1.03 | learning rate: 2.479E-05 | global batch size: 256 | lm loss: 1.899399E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.799 | TFLOPs: 41.12 | 15: iteration 112490/ 125429 | consumed samples: 28797440 | consumed tokens: 58977157120 | elapsed time per iteration (s): 1.03 | learning rate: 2.478E-05 | global batch size: 256 | lm loss: 1.859351E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.353 | TFLOPs: 40.88 | 15: iteration 112500/ 125429 | consumed samples: 28800000 | consumed tokens: 58982400000 | elapsed time per iteration (s): 1.04 | learning rate: 2.477E-05 | global batch size: 256 | lm loss: 1.896511E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.259 | TFLOPs: 40.86 | 15: iteration 112510/ 125429 | consumed samples: 28802560 | consumed tokens: 58987642880 | elapsed time per iteration (s): 1.06 | learning rate: 2.476E-05 | global batch size: 256 | lm loss: 1.898120E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.413 | TFLOPs: 39.73 | 15: iteration 112520/ 125429 | consumed samples: 28805120 | consumed tokens: 58992885760 | elapsed time per iteration (s): 1.07 | learning rate: 2.476E-05 | global batch size: 256 | lm loss: 1.878508E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.169 | TFLOPs: 39.52 | 15: iteration 112530/ 125429 | consumed samples: 28807680 | consumed tokens: 58998128640 | elapsed time per iteration (s): 1.03 | learning rate: 2.475E-05 | global batch size: 256 | lm loss: 1.907144E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.656 | TFLOPs: 41.09 | 15: iteration 112540/ 125429 | consumed samples: 28810240 | consumed tokens: 59003371520 | elapsed time per iteration (s): 1.04 | learning rate: 2.474E-05 | global batch size: 256 | lm loss: 1.889292E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.005 | TFLOPs: 40.82 | 15: iteration 112550/ 125429 | consumed samples: 28812800 | consumed tokens: 59008614400 | elapsed time per iteration (s): 1.04 | learning rate: 2.474E-05 | global batch size: 256 | lm loss: 1.898503E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.495 | TFLOPs: 40.57 | 15: iteration 112560/ 125429 | consumed samples: 28815360 | consumed tokens: 59013857280 | elapsed time per iteration (s): 1.04 | learning rate: 2.473E-05 | global batch size: 256 | lm loss: 1.879398E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.078 | TFLOPs: 40.83 | 15: iteration 112570/ 125429 | consumed samples: 28817920 | consumed tokens: 59019100160 | elapsed time per iteration (s): 1.03 | learning rate: 2.472E-05 | global batch size: 256 | lm loss: 1.911619E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.575 | TFLOPs: 40.91 | 15: iteration 112580/ 125429 | consumed samples: 28820480 | consumed tokens: 59024343040 | elapsed time per iteration (s): 1.02 | learning rate: 2.471E-05 | global batch size: 256 | lm loss: 1.876989E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.696 | TFLOPs: 41.43 | 15: iteration 112590/ 125429 | consumed samples: 28823040 | consumed tokens: 59029585920 | elapsed time per iteration (s): 1.02 | learning rate: 2.471E-05 | global batch size: 256 | lm loss: 1.905875E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.352 | TFLOPs: 41.37 | 15: iteration 112600/ 125429 | consumed samples: 28825600 | consumed tokens: 59034828800 | elapsed time per iteration (s): 1.03 | learning rate: 2.470E-05 | global batch size: 256 | lm loss: 1.891192E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.947 | TFLOPs: 41.14 | 15: iteration 112610/ 125429 | consumed samples: 28828160 | consumed tokens: 59040071680 | elapsed time per iteration (s): 1.03 | learning rate: 2.469E-05 | global batch size: 256 | lm loss: 1.882619E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.508 | TFLOPs: 40.90 | 15: iteration 112620/ 125429 | consumed samples: 28830720 | consumed tokens: 59045314560 | elapsed time per iteration (s): 1.03 | learning rate: 2.468E-05 | global batch size: 256 | lm loss: 1.909260E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.196 | TFLOPs: 41.18 | 15: iteration 112630/ 125429 | consumed samples: 28833280 | consumed tokens: 59050557440 | elapsed time per iteration (s): 1.02 | learning rate: 2.468E-05 | global batch size: 256 | lm loss: 1.912541E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.235 | TFLOPs: 41.35 | 15: iteration 112640/ 125429 | consumed samples: 28835840 | consumed tokens: 59055800320 | elapsed time per iteration (s): 1.03 | learning rate: 2.467E-05 | global batch size: 256 | lm loss: 1.906580E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.437 | TFLOPs: 40.89 | 15: iteration 112650/ 125429 | consumed samples: 28838400 | consumed tokens: 59061043200 | elapsed time per iteration (s): 1.08 | learning rate: 2.466E-05 | global batch size: 256 | lm loss: 1.891236E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.094 | TFLOPs: 39.18 | 15: iteration 112660/ 125429 | consumed samples: 28840960 | consumed tokens: 59066286080 | elapsed time per iteration (s): 1.04 | learning rate: 2.466E-05 | global batch size: 256 | lm loss: 1.880062E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.881 | TFLOPs: 40.63 | 15: iteration 112670/ 125429 | consumed samples: 28843520 | consumed tokens: 59071528960 | elapsed time per iteration (s): 1.04 | learning rate: 2.465E-05 | global batch size: 256 | lm loss: 1.886917E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.967 | TFLOPs: 40.81 | 15: iteration 112680/ 125429 | consumed samples: 28846080 | consumed tokens: 59076771840 | elapsed time per iteration (s): 1.03 | learning rate: 2.464E-05 | global batch size: 256 | lm loss: 1.891983E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.652 | TFLOPs: 41.26 | 15: iteration 112690/ 125429 | consumed samples: 28848640 | consumed tokens: 59082014720 | elapsed time per iteration (s): 1.04 | learning rate: 2.463E-05 | global batch size: 256 | lm loss: 1.895326E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.868 | TFLOPs: 40.63 | 15: iteration 112700/ 125429 | consumed samples: 28851200 | consumed tokens: 59087257600 | elapsed time per iteration (s): 1.07 | learning rate: 2.463E-05 | global batch size: 256 | lm loss: 1.902602E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.290 | TFLOPs: 39.71 | 15: iteration 112710/ 125429 | consumed samples: 28853760 | consumed tokens: 59092500480 | elapsed time per iteration (s): 1.04 | learning rate: 2.462E-05 | global batch size: 256 | lm loss: 1.884842E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.581 | TFLOPs: 40.58 | 15: iteration 112720/ 125429 | consumed samples: 28856320 | consumed tokens: 59097743360 | elapsed time per iteration (s): 1.03 | learning rate: 2.461E-05 | global batch size: 256 | lm loss: 1.900061E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.246 | TFLOPs: 41.02 | 15: iteration 112730/ 125429 | consumed samples: 28858880 | consumed tokens: 59102986240 | elapsed time per iteration (s): 1.02 | learning rate: 2.461E-05 | global batch size: 256 | lm loss: 1.891379E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.857 | TFLOPs: 41.46 | 15: iteration 112740/ 125429 | consumed samples: 28861440 | consumed tokens: 59108229120 | elapsed time per iteration (s): 1.04 | learning rate: 2.460E-05 | global batch size: 256 | lm loss: 1.909595E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.323 | TFLOPs: 40.54 | 15: iteration 112750/ 125429 | consumed samples: 28864000 | consumed tokens: 59113472000 | elapsed time per iteration (s): 1.03 | learning rate: 2.459E-05 | global batch size: 256 | lm loss: 1.922409E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.707 | TFLOPs: 41.10 | 15: iteration 112760/ 125429 | consumed samples: 28866560 | consumed tokens: 59118714880 | elapsed time per iteration (s): 1.03 | learning rate: 2.458E-05 | global batch size: 256 | lm loss: 1.864404E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.040 | TFLOPs: 40.99 | 15: iteration 112770/ 125429 | consumed samples: 28869120 | consumed tokens: 59123957760 | elapsed time per iteration (s): 1.04 | learning rate: 2.458E-05 | global batch size: 256 | lm loss: 1.894005E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.919 | TFLOPs: 40.64 | 15: iteration 112780/ 125429 | consumed samples: 28871680 | consumed tokens: 59129200640 | elapsed time per iteration (s): 1.06 | learning rate: 2.457E-05 | global batch size: 256 | lm loss: 1.901411E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.923 | TFLOPs: 39.98 | 15: iteration 112790/ 125429 | consumed samples: 28874240 | consumed tokens: 59134443520 | elapsed time per iteration (s): 1.04 | learning rate: 2.456E-05 | global batch size: 256 | lm loss: 1.870793E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.475 | TFLOPs: 40.57 | 15: iteration 112800/ 125429 | consumed samples: 28876800 | consumed tokens: 59139686400 | elapsed time per iteration (s): 1.04 | learning rate: 2.456E-05 | global batch size: 256 | lm loss: 1.920393E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.895 | TFLOPs: 40.80 | 15: iteration 112810/ 125429 | consumed samples: 28879360 | consumed tokens: 59144929280 | elapsed time per iteration (s): 1.21 | learning rate: 2.455E-05 | global batch size: 256 | lm loss: 1.892767E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.580 | TFLOPs: 34.97 | 15: iteration 112820/ 125429 | consumed samples: 28881920 | consumed tokens: 59150172160 | elapsed time per iteration (s): 1.03 | learning rate: 2.454E-05 | global batch size: 256 | lm loss: 1.899631E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.563 | TFLOPs: 41.08 | 15: iteration 112830/ 125429 | consumed samples: 28884480 | consumed tokens: 59155415040 | elapsed time per iteration (s): 1.03 | learning rate: 2.453E-05 | global batch size: 256 | lm loss: 1.908708E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.820 | TFLOPs: 40.95 | 15: iteration 112840/ 125429 | consumed samples: 28887040 | consumed tokens: 59160657920 | elapsed time per iteration (s): 1.03 | learning rate: 2.453E-05 | global batch size: 256 | lm loss: 1.886714E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.697 | TFLOPs: 41.26 | 15: iteration 112850/ 125429 | consumed samples: 28889600 | consumed tokens: 59165900800 | elapsed time per iteration (s): 1.05 | learning rate: 2.452E-05 | global batch size: 256 | lm loss: 1.896632E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.732 | TFLOPs: 40.28 | 15: iteration 112860/ 125429 | consumed samples: 28892160 | consumed tokens: 59171143680 | elapsed time per iteration (s): 1.06 | learning rate: 2.451E-05 | global batch size: 256 | lm loss: 1.897606E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.894 | TFLOPs: 39.81 | 15: iteration 112870/ 125429 | consumed samples: 28894720 | consumed tokens: 59176386560 | elapsed time per iteration (s): 1.03 | learning rate: 2.451E-05 | global batch size: 256 | lm loss: 1.877201E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.821 | TFLOPs: 41.12 | 15: iteration 112880/ 125429 | consumed samples: 28897280 | consumed tokens: 59181629440 | elapsed time per iteration (s): 1.07 | learning rate: 2.450E-05 | global batch size: 256 | lm loss: 1.914454E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.003 | TFLOPs: 39.50 | 15: iteration 112890/ 125429 | consumed samples: 28899840 | consumed tokens: 59186872320 | elapsed time per iteration (s): 1.04 | learning rate: 2.449E-05 | global batch size: 256 | lm loss: 1.895062E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.853 | TFLOPs: 40.79 | 15: iteration 112900/ 125429 | consumed samples: 28902400 | consumed tokens: 59192115200 | elapsed time per iteration (s): 1.04 | learning rate: 2.448E-05 | global batch size: 256 | lm loss: 1.889817E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.468 | TFLOPs: 40.73 | 15: iteration 112910/ 125429 | consumed samples: 28904960 | consumed tokens: 59197358080 | elapsed time per iteration (s): 1.05 | learning rate: 2.448E-05 | global batch size: 256 | lm loss: 1.885981E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.193 | TFLOPs: 40.35 | 15: iteration 112920/ 125429 | consumed samples: 28907520 | consumed tokens: 59202600960 | elapsed time per iteration (s): 1.08 | learning rate: 2.447E-05 | global batch size: 256 | lm loss: 1.914151E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.238 | TFLOPs: 39.04 | 15: iteration 112930/ 125429 | consumed samples: 28910080 | consumed tokens: 59207843840 | elapsed time per iteration (s): 1.04 | learning rate: 2.446E-05 | global batch size: 256 | lm loss: 1.877720E+00 | grad norm: 0.215 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.454 | TFLOPs: 40.56 | 15: iteration 112940/ 125429 | consumed samples: 28912640 | consumed tokens: 59213086720 | elapsed time per iteration (s): 1.05 | learning rate: 2.446E-05 | global batch size: 256 | lm loss: 1.916812E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.692 | TFLOPs: 40.44 | 15: iteration 112950/ 125429 | consumed samples: 28915200 | consumed tokens: 59218329600 | elapsed time per iteration (s): 1.05 | learning rate: 2.445E-05 | global batch size: 256 | lm loss: 1.886522E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.850 | TFLOPs: 40.13 | 15: iteration 112960/ 125429 | consumed samples: 28917760 | consumed tokens: 59223572480 | elapsed time per iteration (s): 1.03 | learning rate: 2.444E-05 | global batch size: 256 | lm loss: 1.899476E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.997 | TFLOPs: 40.98 | 15: iteration 112970/ 125429 | consumed samples: 28920320 | consumed tokens: 59228815360 | elapsed time per iteration (s): 1.05 | learning rate: 2.443E-05 | global batch size: 256 | lm loss: 1.901096E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.397 | TFLOPs: 40.22 | 15: iteration 112980/ 125429 | consumed samples: 28922880 | consumed tokens: 59234058240 | elapsed time per iteration (s): 1.07 | learning rate: 2.443E-05 | global batch size: 256 | lm loss: 1.901361E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.702 | TFLOPs: 39.61 | 15: iteration 112990/ 125429 | consumed samples: 28925440 | consumed tokens: 59239301120 | elapsed time per iteration (s): 1.05 | learning rate: 2.442E-05 | global batch size: 256 | lm loss: 1.888174E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.965 | TFLOPs: 40.48 | 15: iteration 113000/ 125429 | consumed samples: 28928000 | consumed tokens: 59244544000 | elapsed time per iteration (s): 1.04 | learning rate: 2.441E-05 | global batch size: 256 | lm loss: 1.890025E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.001 | TFLOPs: 40.49 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 113000 | lm loss value: 1.795574E+00 | lm loss PPL: 6.022932E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 113000 to checkpoints_1b5 0: [2022-11-27 05:28:30,489] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step113000 is begin to save! 0: [2022-11-27 05:28:30,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:28:30,794] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:28:30,795] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:28:30,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:28:30,901] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:28:31,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:28:31,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:28:31,101] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:28:31,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:28:31,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:28:31,208] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:28:31,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:28:31,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:28:31,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:28:31,417] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:28:31,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:28:31,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:28:31,621] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:28:31,621] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:28:31,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:28:31,727] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:28:31,825] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:28:31,825] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:28:31,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:28:31,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:28:32,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:28:32,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:28:32,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:28:32,130] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:28:32,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:28:32,235] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:28:32,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:28:32,341] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:28:32,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:28:32,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:28:32,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:28:32,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:28:32,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:28:32,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:28:32,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:28:32,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:28:32,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:28:32,878] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:28:32,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:28:32,985] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:28:33,092] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:28:33,092] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:28:33,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:28:33,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:28:33,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:28:33,305] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:28:33,410] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:28:33,410] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:28:33,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:28:33,514] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_29-model_00-model_states.pt... 0: [2022-11-27 05:28:33,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_29-model_00-model_states.pt. 0: [2022-11-27 05:28:33,622] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:28:33,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:28:33,728] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/layer_32-model_00-model_states.pt... 0: [2022-11-27 05:28:33,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/layer_32-model_00-model_states.pt. 0: [2022-11-27 05:28:33,736] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step113000/mp_rank_00_model_states.pt 0: [2022-11-27 05:28:33,736] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:28:33,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/mp_rank_00_model_states.pt. 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:28:33,778] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step113000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:28:33,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:33,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:33,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:33,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:33,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-27 05:28:33,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-27 05:28:33,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:33,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:33,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 05:28:33,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-27 05:28:33,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-27 05:28:33,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:33,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:28:33,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:33,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:33,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:28:33,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:33,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-27 05:28:33,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:28:33,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-27 05:28:33,945] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:33,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:33,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:33,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:33,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:33,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:33,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:33,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-27 05:28:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-27 05:28:33,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-27 05:28:33,958] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:33,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:33,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:33,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:33,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:33,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:28:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:28:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:28:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:28:33,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:28:33,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,957] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,957] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:33,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:28:33,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:33,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-27 05:28:33,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:33,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,966] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-27 05:28:33,966] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:33,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:33,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:33,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:33,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:33,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:33,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:33,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:33,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-27 05:28:33,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:33,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:33,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:33,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-27 05:28:33,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:33,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,948] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 05:28:33,948] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:33,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 05:28:33,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:33,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 05:28:33,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:33,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:33,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 05:28:33,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-27 05:28:33,970] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,970] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,970] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-27 05:28:33,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 9: [2022-11-27 05:28:33,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:33,974] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:33,974] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-27 05:28:33,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:28:33,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 7: [2022-11-27 05:28:33,976] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 05:28:33,976] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:33,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:28:33,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:33,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:33,977] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:33,977] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:33,977] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:33,981] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:33,981] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:33,981] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:28:33,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 15: [2022-11-27 05:28:33,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:28:33,973] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 05:28:33,973] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 5: [2022-11-27 05:28:33,985] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:28:33,985] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 05:28:33,986] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 12: [2022-11-27 05:28:33,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:28:33,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 05:28:33,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 10: [2022-11-27 05:28:33,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:28:33,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 05:28:33,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-27 05:28:33,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-27 05:28:33,992] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,992] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,992] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 2: [2022-11-27 05:28:33,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:28:33,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 05:28:33,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 1: [2022-11-27 05:28:33,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:28:33,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 05:28:33,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:33,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:28:33,967] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:33,967] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:33,991] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:28:33,991] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:33,991] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:34,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:34,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:28:34,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:28:34,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:34,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:34,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:34,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 05:28:34,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 3: [2022-11-27 05:28:34,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:34,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:28:34,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:34,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:34,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:34,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:28:34,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 05:28:34,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,955] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,969] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,969] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,969] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 4: [2022-11-27 05:28:33,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:28:33,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 05:28:33,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-27 05:28:34,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:34,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:34,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-27 05:28:34,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:34,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:34,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-27 05:28:34,013] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:34,013] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:34,013] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 6: [2022-11-27 05:28:34,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:28:34,014] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 05:28:34,014] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 13: [2022-11-27 05:28:34,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:28:34,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 05:28:34,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 14: [2022-11-27 05:28:34,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:28:34,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 05:28:34,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:34,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:34,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 11: [2022-11-27 05:28:34,004] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:28:34,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 05:28:34,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,070] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 8: [2022-11-27 05:28:34,071] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: [2022-11-27 05:28:34,114] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step113000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 05:28:34,114] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step113000 is ready now! 0: successfully saved checkpoint at iteration 113000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3681.44 15: iteration 113010/ 125429 | consumed samples: 28930560 | consumed tokens: 59249786880 | elapsed time per iteration (s): 1.49 | learning rate: 2.441E-05 | global batch size: 256 | lm loss: 1.910690E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 172.038 | TFLOPs: 28.43 | 15: iteration 113020/ 125429 | consumed samples: 28933120 | consumed tokens: 59255029760 | elapsed time per iteration (s): 1.05 | learning rate: 2.440E-05 | global batch size: 256 | lm loss: 1.911449E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.784 | TFLOPs: 40.29 | 15: iteration 113030/ 125429 | consumed samples: 28935680 | consumed tokens: 59260272640 | elapsed time per iteration (s): 1.05 | learning rate: 2.439E-05 | global batch size: 256 | lm loss: 1.881184E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.884 | TFLOPs: 40.30 | 15: iteration 113040/ 125429 | consumed samples: 28938240 | consumed tokens: 59265515520 | elapsed time per iteration (s): 1.05 | learning rate: 2.438E-05 | global batch size: 256 | lm loss: 1.900972E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.377 | TFLOPs: 40.22 | 15: iteration 113050/ 125429 | consumed samples: 28940800 | consumed tokens: 59270758400 | elapsed time per iteration (s): 1.06 | learning rate: 2.438E-05 | global batch size: 256 | lm loss: 1.908176E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.171 | TFLOPs: 39.86 | 15: iteration 113060/ 125429 | consumed samples: 28943360 | consumed tokens: 59276001280 | elapsed time per iteration (s): 1.03 | learning rate: 2.437E-05 | global batch size: 256 | lm loss: 1.925823E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.254 | TFLOPs: 41.03 | 15: iteration 113070/ 125429 | consumed samples: 28945920 | consumed tokens: 59281244160 | elapsed time per iteration (s): 1.06 | learning rate: 2.436E-05 | global batch size: 256 | lm loss: 1.860311E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.996 | TFLOPs: 39.99 | 15: iteration 113080/ 125429 | consumed samples: 28948480 | consumed tokens: 59286487040 | elapsed time per iteration (s): 1.22 | learning rate: 2.436E-05 | global batch size: 256 | lm loss: 1.894505E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.300 | TFLOPs: 34.75 | 15: iteration 113090/ 125429 | consumed samples: 28951040 | consumed tokens: 59291729920 | elapsed time per iteration (s): 1.02 | learning rate: 2.435E-05 | global batch size: 256 | lm loss: 1.901171E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.853 | TFLOPs: 41.29 | 15: iteration 113100/ 125429 | consumed samples: 28953600 | consumed tokens: 59296972800 | elapsed time per iteration (s): 1.05 | learning rate: 2.434E-05 | global batch size: 256 | lm loss: 1.898517E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.711 | TFLOPs: 40.44 | 15: iteration 113110/ 125429 | consumed samples: 28956160 | consumed tokens: 59302215680 | elapsed time per iteration (s): 1.04 | learning rate: 2.434E-05 | global batch size: 256 | lm loss: 1.885869E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.452 | TFLOPs: 40.56 | 15: iteration 113120/ 125429 | consumed samples: 28958720 | consumed tokens: 59307458560 | elapsed time per iteration (s): 1.37 | learning rate: 2.433E-05 | global batch size: 256 | lm loss: 1.922182E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 186.508 | TFLOPs: 30.82 | 15: iteration 113130/ 125429 | consumed samples: 28961280 | consumed tokens: 59312701440 | elapsed time per iteration (s): 1.06 | learning rate: 2.432E-05 | global batch size: 256 | lm loss: 1.910592E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.428 | TFLOPs: 39.90 | 15: iteration 113140/ 125429 | consumed samples: 28963840 | consumed tokens: 59317944320 | elapsed time per iteration (s): 1.07 | learning rate: 2.432E-05 | global batch size: 256 | lm loss: 1.873304E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.154 | TFLOPs: 39.69 | 15: iteration 113150/ 125429 | consumed samples: 28966400 | consumed tokens: 59323187200 | elapsed time per iteration (s): 1.05 | learning rate: 2.431E-05 | global batch size: 256 | lm loss: 1.888722E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.778 | TFLOPs: 40.29 | 15: iteration 113160/ 125429 | consumed samples: 28968960 | consumed tokens: 59328430080 | elapsed time per iteration (s): 1.03 | learning rate: 2.430E-05 | global batch size: 256 | lm loss: 1.897184E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.489 | TFLOPs: 41.23 | 15: iteration 113170/ 125429 | consumed samples: 28971520 | consumed tokens: 59333672960 | elapsed time per iteration (s): 1.20 | learning rate: 2.429E-05 | global batch size: 256 | lm loss: 1.864908E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.302 | TFLOPs: 35.25 | 15: iteration 113180/ 125429 | consumed samples: 28974080 | consumed tokens: 59338915840 | elapsed time per iteration (s): 1.04 | learning rate: 2.429E-05 | global batch size: 256 | lm loss: 1.909668E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.773 | TFLOPs: 40.62 | 15: iteration 113190/ 125429 | consumed samples: 28976640 | consumed tokens: 59344158720 | elapsed time per iteration (s): 1.04 | learning rate: 2.428E-05 | global batch size: 256 | lm loss: 1.904902E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.080 | TFLOPs: 40.50 | 15: iteration 113200/ 125429 | consumed samples: 28979200 | consumed tokens: 59349401600 | elapsed time per iteration (s): 1.04 | learning rate: 2.427E-05 | global batch size: 256 | lm loss: 1.921501E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.385 | TFLOPs: 40.55 | 15: iteration 113210/ 125429 | consumed samples: 28981760 | consumed tokens: 59354644480 | elapsed time per iteration (s): 1.04 | learning rate: 2.427E-05 | global batch size: 256 | lm loss: 1.896163E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.812 | TFLOPs: 40.79 | 15: iteration 113220/ 125429 | consumed samples: 28984320 | consumed tokens: 59359887360 | elapsed time per iteration (s): 1.03 | learning rate: 2.426E-05 | global batch size: 256 | lm loss: 1.849000E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.323 | TFLOPs: 41.20 | 15: iteration 113230/ 125429 | consumed samples: 28986880 | consumed tokens: 59365130240 | elapsed time per iteration (s): 1.04 | learning rate: 2.425E-05 | global batch size: 256 | lm loss: 1.888540E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.083 | TFLOPs: 40.67 | 15: iteration 113240/ 125429 | consumed samples: 28989440 | consumed tokens: 59370373120 | elapsed time per iteration (s): 1.02 | learning rate: 2.425E-05 | global batch size: 256 | lm loss: 1.892583E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.831 | TFLOPs: 41.45 | 15: iteration 113250/ 125429 | consumed samples: 28992000 | consumed tokens: 59375616000 | elapsed time per iteration (s): 1.04 | learning rate: 2.424E-05 | global batch size: 256 | lm loss: 1.883091E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.167 | TFLOPs: 40.52 | 15: iteration 113260/ 125429 | consumed samples: 28994560 | consumed tokens: 59380858880 | elapsed time per iteration (s): 1.07 | learning rate: 2.423E-05 | global batch size: 256 | lm loss: 1.892533E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.804 | TFLOPs: 39.46 | 15: iteration 113270/ 125429 | consumed samples: 28997120 | consumed tokens: 59386101760 | elapsed time per iteration (s): 1.04 | learning rate: 2.422E-05 | global batch size: 256 | lm loss: 1.893289E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.886 | TFLOPs: 40.80 | 15: iteration 113280/ 125429 | consumed samples: 28999680 | consumed tokens: 59391344640 | elapsed time per iteration (s): 1.07 | learning rate: 2.422E-05 | global batch size: 256 | lm loss: 1.914820E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.451 | TFLOPs: 39.41 | 15: iteration 113290/ 125429 | consumed samples: 29002240 | consumed tokens: 59396587520 | elapsed time per iteration (s): 1.02 | learning rate: 2.421E-05 | global batch size: 256 | lm loss: 1.892092E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.635 | TFLOPs: 41.42 | 15: iteration 113300/ 125429 | consumed samples: 29004800 | consumed tokens: 59401830400 | elapsed time per iteration (s): 1.06 | learning rate: 2.420E-05 | global batch size: 256 | lm loss: 1.875251E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.851 | TFLOPs: 39.80 | 15: iteration 113310/ 125429 | consumed samples: 29007360 | consumed tokens: 59407073280 | elapsed time per iteration (s): 1.05 | learning rate: 2.420E-05 | global batch size: 256 | lm loss: 1.887197E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.349 | TFLOPs: 40.38 | 15: iteration 113320/ 125429 | consumed samples: 29009920 | consumed tokens: 59412316160 | elapsed time per iteration (s): 1.03 | learning rate: 2.419E-05 | global batch size: 256 | lm loss: 1.874062E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.029 | TFLOPs: 40.99 | 15: iteration 113330/ 125429 | consumed samples: 29012480 | consumed tokens: 59417559040 | elapsed time per iteration (s): 1.06 | learning rate: 2.418E-05 | global batch size: 256 | lm loss: 1.893647E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.485 | TFLOPs: 40.07 | 15: iteration 113340/ 125429 | consumed samples: 29015040 | consumed tokens: 59422801920 | elapsed time per iteration (s): 1.05 | learning rate: 2.418E-05 | global batch size: 256 | lm loss: 1.887391E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.657 | TFLOPs: 40.27 | 15: iteration 113350/ 125429 | consumed samples: 29017600 | consumed tokens: 59428044800 | elapsed time per iteration (s): 1.08 | learning rate: 2.417E-05 | global batch size: 256 | lm loss: 1.869357E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.094 | TFLOPs: 39.02 | 15: iteration 113360/ 125429 | consumed samples: 29020160 | consumed tokens: 59433287680 | elapsed time per iteration (s): 1.08 | learning rate: 2.416E-05 | global batch size: 256 | lm loss: 1.906459E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.689 | TFLOPs: 39.28 | 15: iteration 113370/ 125429 | consumed samples: 29022720 | consumed tokens: 59438530560 | elapsed time per iteration (s): 1.04 | learning rate: 2.416E-05 | global batch size: 256 | lm loss: 1.871333E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.538 | TFLOPs: 40.58 | 15: iteration 113380/ 125429 | consumed samples: 29025280 | consumed tokens: 59443773440 | elapsed time per iteration (s): 1.04 | learning rate: 2.415E-05 | global batch size: 256 | lm loss: 1.876731E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.363 | TFLOPs: 40.55 | 15: iteration 113390/ 125429 | consumed samples: 29027840 | consumed tokens: 59449016320 | elapsed time per iteration (s): 1.03 | learning rate: 2.414E-05 | global batch size: 256 | lm loss: 1.880745E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.332 | TFLOPs: 41.04 | 15: iteration 113400/ 125429 | consumed samples: 29030400 | consumed tokens: 59454259200 | elapsed time per iteration (s): 1.07 | learning rate: 2.414E-05 | global batch size: 256 | lm loss: 1.901007E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.978 | TFLOPs: 39.66 | 15: iteration 113410/ 125429 | consumed samples: 29032960 | consumed tokens: 59459502080 | elapsed time per iteration (s): 1.04 | learning rate: 2.413E-05 | global batch size: 256 | lm loss: 1.903591E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.173 | TFLOPs: 40.52 | 15: iteration 113420/ 125429 | consumed samples: 29035520 | consumed tokens: 59464744960 | elapsed time per iteration (s): 1.04 | learning rate: 2.412E-05 | global batch size: 256 | lm loss: 1.878365E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.786 | TFLOPs: 40.78 | 15: iteration 113430/ 125429 | consumed samples: 29038080 | consumed tokens: 59469987840 | elapsed time per iteration (s): 1.04 | learning rate: 2.412E-05 | global batch size: 256 | lm loss: 1.896568E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.680 | TFLOPs: 40.60 | 15: iteration 113440/ 125429 | consumed samples: 29040640 | consumed tokens: 59475230720 | elapsed time per iteration (s): 1.19 | learning rate: 2.411E-05 | global batch size: 256 | lm loss: 1.914550E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.673 | TFLOPs: 35.48 | 15: iteration 113450/ 125429 | consumed samples: 29043200 | consumed tokens: 59480473600 | elapsed time per iteration (s): 1.03 | learning rate: 2.410E-05 | global batch size: 256 | lm loss: 1.874943E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.984 | TFLOPs: 41.15 | 15: iteration 113460/ 125429 | consumed samples: 29045760 | consumed tokens: 59485716480 | elapsed time per iteration (s): 1.04 | learning rate: 2.409E-05 | global batch size: 256 | lm loss: 1.901827E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.589 | TFLOPs: 40.75 | 15: iteration 113470/ 125429 | consumed samples: 29048320 | consumed tokens: 59490959360 | elapsed time per iteration (s): 1.30 | learning rate: 2.409E-05 | global batch size: 256 | lm loss: 1.876774E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 196.537 | TFLOPs: 32.48 | 15: iteration 113480/ 125429 | consumed samples: 29050880 | consumed tokens: 59496202240 | elapsed time per iteration (s): 1.04 | learning rate: 2.408E-05 | global batch size: 256 | lm loss: 1.903759E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.342 | TFLOPs: 40.88 | 15: iteration 113490/ 125429 | consumed samples: 29053440 | consumed tokens: 59501445120 | elapsed time per iteration (s): 1.05 | learning rate: 2.407E-05 | global batch size: 256 | lm loss: 1.866408E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.275 | TFLOPs: 40.20 | 15: iteration 113500/ 125429 | consumed samples: 29056000 | consumed tokens: 59506688000 | elapsed time per iteration (s): 1.04 | learning rate: 2.407E-05 | global batch size: 256 | lm loss: 1.877831E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.282 | TFLOPs: 40.87 | 15: iteration 113510/ 125429 | consumed samples: 29058560 | consumed tokens: 59511930880 | elapsed time per iteration (s): 1.07 | learning rate: 2.406E-05 | global batch size: 256 | lm loss: 1.895391E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.555 | TFLOPs: 39.59 | 15: iteration 113520/ 125429 | consumed samples: 29061120 | consumed tokens: 59517173760 | elapsed time per iteration (s): 1.03 | learning rate: 2.405E-05 | global batch size: 256 | lm loss: 1.909419E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.717 | TFLOPs: 41.10 | 15: iteration 113530/ 125429 | consumed samples: 29063680 | consumed tokens: 59522416640 | elapsed time per iteration (s): 1.03 | learning rate: 2.405E-05 | global batch size: 256 | lm loss: 1.899236E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.161 | TFLOPs: 41.01 | 15: iteration 113540/ 125429 | consumed samples: 29066240 | consumed tokens: 59527659520 | elapsed time per iteration (s): 1.11 | learning rate: 2.404E-05 | global batch size: 256 | lm loss: 1.900218E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.229 | TFLOPs: 38.05 | 15: iteration 113550/ 125429 | consumed samples: 29068800 | consumed tokens: 59532902400 | elapsed time per iteration (s): 1.03 | learning rate: 2.403E-05 | global batch size: 256 | lm loss: 1.904040E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.432 | TFLOPs: 41.22 | 15: iteration 113560/ 125429 | consumed samples: 29071360 | consumed tokens: 59538145280 | elapsed time per iteration (s): 1.07 | learning rate: 2.403E-05 | global batch size: 256 | lm loss: 1.886956E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.045 | TFLOPs: 39.67 | 15: iteration 113570/ 125429 | consumed samples: 29073920 | consumed tokens: 59543388160 | elapsed time per iteration (s): 1.05 | learning rate: 2.402E-05 | global batch size: 256 | lm loss: 1.910086E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.680 | TFLOPs: 40.10 | 15: iteration 113580/ 125429 | consumed samples: 29076480 | consumed tokens: 59548631040 | elapsed time per iteration (s): 1.03 | learning rate: 2.401E-05 | global batch size: 256 | lm loss: 1.933326E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.231 | TFLOPs: 41.02 | 15: iteration 113590/ 125429 | consumed samples: 29079040 | consumed tokens: 59553873920 | elapsed time per iteration (s): 1.06 | learning rate: 2.401E-05 | global batch size: 256 | lm loss: 1.869003E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.538 | TFLOPs: 40.08 | 15: iteration 113600/ 125429 | consumed samples: 29081600 | consumed tokens: 59559116800 | elapsed time per iteration (s): 1.06 | learning rate: 2.400E-05 | global batch size: 256 | lm loss: 1.892885E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.093 | TFLOPs: 40.01 | 15: iteration 113610/ 125429 | consumed samples: 29084160 | consumed tokens: 59564359680 | elapsed time per iteration (s): 1.04 | learning rate: 2.399E-05 | global batch size: 256 | lm loss: 1.890274E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.105 | TFLOPs: 40.84 | 15: iteration 113620/ 125429 | consumed samples: 29086720 | consumed tokens: 59569602560 | elapsed time per iteration (s): 1.04 | learning rate: 2.399E-05 | global batch size: 256 | lm loss: 1.857200E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.760 | TFLOPs: 40.61 | 15: iteration 113630/ 125429 | consumed samples: 29089280 | consumed tokens: 59574845440 | elapsed time per iteration (s): 1.04 | learning rate: 2.398E-05 | global batch size: 256 | lm loss: 1.914210E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.759 | TFLOPs: 40.78 | 15: iteration 113640/ 125429 | consumed samples: 29091840 | consumed tokens: 59580088320 | elapsed time per iteration (s): 1.07 | learning rate: 2.397E-05 | global batch size: 256 | lm loss: 1.906689E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.255 | TFLOPs: 39.70 | 15: iteration 113650/ 125429 | consumed samples: 29094400 | consumed tokens: 59585331200 | elapsed time per iteration (s): 1.11 | learning rate: 2.397E-05 | global batch size: 256 | lm loss: 1.906053E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.830 | TFLOPs: 38.15 | 15: iteration 113660/ 125429 | consumed samples: 29096960 | consumed tokens: 59590574080 | elapsed time per iteration (s): 1.05 | learning rate: 2.396E-05 | global batch size: 256 | lm loss: 1.897021E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.427 | TFLOPs: 40.23 | 15: iteration 113670/ 125429 | consumed samples: 29099520 | consumed tokens: 59595816960 | elapsed time per iteration (s): 1.02 | learning rate: 2.395E-05 | global batch size: 256 | lm loss: 1.919014E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.347 | TFLOPs: 41.37 | 15: iteration 113680/ 125429 | consumed samples: 29102080 | consumed tokens: 59601059840 | elapsed time per iteration (s): 1.05 | learning rate: 2.395E-05 | global batch size: 256 | lm loss: 1.906074E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.038 | TFLOPs: 40.16 | 15: iteration 113690/ 125429 | consumed samples: 29104640 | consumed tokens: 59606302720 | elapsed time per iteration (s): 1.09 | learning rate: 2.394E-05 | global batch size: 256 | lm loss: 1.900259E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.872 | TFLOPs: 38.98 | 15: iteration 113700/ 125429 | consumed samples: 29107200 | consumed tokens: 59611545600 | elapsed time per iteration (s): 1.06 | learning rate: 2.393E-05 | global batch size: 256 | lm loss: 1.906974E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.671 | TFLOPs: 39.94 | 15: iteration 113710/ 125429 | consumed samples: 29109760 | consumed tokens: 59616788480 | elapsed time per iteration (s): 1.10 | learning rate: 2.393E-05 | global batch size: 256 | lm loss: 1.888031E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.502 | TFLOPs: 38.42 | 15: iteration 113720/ 125429 | consumed samples: 29112320 | consumed tokens: 59622031360 | elapsed time per iteration (s): 1.04 | learning rate: 2.392E-05 | global batch size: 256 | lm loss: 1.867572E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.160 | TFLOPs: 40.51 | 15: iteration 113730/ 125429 | consumed samples: 29114880 | consumed tokens: 59627274240 | elapsed time per iteration (s): 1.05 | learning rate: 2.391E-05 | global batch size: 256 | lm loss: 1.888700E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.445 | TFLOPs: 40.40 | 15: iteration 113740/ 125429 | consumed samples: 29117440 | consumed tokens: 59632517120 | elapsed time per iteration (s): 1.19 | learning rate: 2.391E-05 | global batch size: 256 | lm loss: 1.890861E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.298 | TFLOPs: 35.41 | 15: iteration 113750/ 125429 | consumed samples: 29120000 | consumed tokens: 59637760000 | elapsed time per iteration (s): 1.08 | learning rate: 2.390E-05 | global batch size: 256 | lm loss: 1.865807E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.302 | TFLOPs: 39.05 | 15: iteration 113760/ 125429 | consumed samples: 29122560 | consumed tokens: 59643002880 | elapsed time per iteration (s): 1.19 | learning rate: 2.389E-05 | global batch size: 256 | lm loss: 1.885661E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.853 | TFLOPs: 35.67 | 15: iteration 113770/ 125429 | consumed samples: 29125120 | consumed tokens: 59648245760 | elapsed time per iteration (s): 1.06 | learning rate: 2.389E-05 | global batch size: 256 | lm loss: 1.891513E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.163 | TFLOPs: 40.02 | 15: iteration 113780/ 125429 | consumed samples: 29127680 | consumed tokens: 59653488640 | elapsed time per iteration (s): 1.04 | learning rate: 2.388E-05 | global batch size: 256 | lm loss: 1.897453E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.762 | TFLOPs: 40.61 | 15: iteration 113790/ 125429 | consumed samples: 29130240 | consumed tokens: 59658731520 | elapsed time per iteration (s): 1.03 | learning rate: 2.387E-05 | global batch size: 256 | lm loss: 1.907537E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.494 | TFLOPs: 41.07 | 15: iteration 113800/ 125429 | consumed samples: 29132800 | consumed tokens: 59663974400 | elapsed time per iteration (s): 1.05 | learning rate: 2.387E-05 | global batch size: 256 | lm loss: 1.880416E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.678 | TFLOPs: 40.10 | 15: iteration 113810/ 125429 | consumed samples: 29135360 | consumed tokens: 59669217280 | elapsed time per iteration (s): 1.05 | learning rate: 2.386E-05 | global batch size: 256 | lm loss: 1.880048E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.827 | TFLOPs: 40.29 | 15: iteration 113820/ 125429 | consumed samples: 29137920 | consumed tokens: 59674460160 | elapsed time per iteration (s): 1.05 | learning rate: 2.385E-05 | global batch size: 256 | lm loss: 1.873048E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.632 | TFLOPs: 40.43 | 15: iteration 113830/ 125429 | consumed samples: 29140480 | consumed tokens: 59679703040 | elapsed time per iteration (s): 1.02 | learning rate: 2.385E-05 | global batch size: 256 | lm loss: 1.878305E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.501 | TFLOPs: 41.40 | 15: iteration 113840/ 125429 | consumed samples: 29143040 | consumed tokens: 59684945920 | elapsed time per iteration (s): 1.03 | learning rate: 2.384E-05 | global batch size: 256 | lm loss: 1.895762E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.437 | TFLOPs: 40.89 | 15: iteration 113850/ 125429 | consumed samples: 29145600 | consumed tokens: 59690188800 | elapsed time per iteration (s): 1.05 | learning rate: 2.383E-05 | global batch size: 256 | lm loss: 1.885118E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.502 | TFLOPs: 40.24 | 15: iteration 113860/ 125429 | consumed samples: 29148160 | consumed tokens: 59695431680 | elapsed time per iteration (s): 1.03 | learning rate: 2.383E-05 | global batch size: 256 | lm loss: 1.887052E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.069 | TFLOPs: 41.00 | 15: iteration 113870/ 125429 | consumed samples: 29150720 | consumed tokens: 59700674560 | elapsed time per iteration (s): 1.07 | learning rate: 2.382E-05 | global batch size: 256 | lm loss: 1.880177E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.610 | TFLOPs: 39.60 | 15: iteration 113880/ 125429 | consumed samples: 29153280 | consumed tokens: 59705917440 | elapsed time per iteration (s): 1.03 | learning rate: 2.381E-05 | global batch size: 256 | lm loss: 1.880090E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.714 | TFLOPs: 41.27 | 15: iteration 113890/ 125429 | consumed samples: 29155840 | consumed tokens: 59711160320 | elapsed time per iteration (s): 1.04 | learning rate: 2.381E-05 | global batch size: 256 | lm loss: 1.904785E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.368 | TFLOPs: 40.71 | 15: iteration 113900/ 125429 | consumed samples: 29158400 | consumed tokens: 59716403200 | elapsed time per iteration (s): 1.04 | learning rate: 2.380E-05 | global batch size: 256 | lm loss: 1.900307E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.183 | TFLOPs: 40.68 | 15: iteration 113910/ 125429 | consumed samples: 29160960 | consumed tokens: 59721646080 | elapsed time per iteration (s): 1.03 | learning rate: 2.379E-05 | global batch size: 256 | lm loss: 1.904954E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.947 | TFLOPs: 40.98 | 15: iteration 113920/ 125429 | consumed samples: 29163520 | consumed tokens: 59726888960 | elapsed time per iteration (s): 1.04 | learning rate: 2.379E-05 | global batch size: 256 | lm loss: 1.900247E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.599 | TFLOPs: 40.59 | 15: iteration 113930/ 125429 | consumed samples: 29166080 | consumed tokens: 59732131840 | elapsed time per iteration (s): 1.04 | learning rate: 2.378E-05 | global batch size: 256 | lm loss: 1.879013E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.712 | TFLOPs: 40.77 | 15: iteration 113940/ 125429 | consumed samples: 29168640 | consumed tokens: 59737374720 | elapsed time per iteration (s): 1.04 | learning rate: 2.378E-05 | global batch size: 256 | lm loss: 1.919134E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.019 | TFLOPs: 40.66 | 15: iteration 113950/ 125429 | consumed samples: 29171200 | consumed tokens: 59742617600 | elapsed time per iteration (s): 1.02 | learning rate: 2.377E-05 | global batch size: 256 | lm loss: 1.913573E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.548 | TFLOPs: 41.40 | 15: iteration 113960/ 125429 | consumed samples: 29173760 | consumed tokens: 59747860480 | elapsed time per iteration (s): 1.05 | learning rate: 2.376E-05 | global batch size: 256 | lm loss: 1.872638E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.260 | TFLOPs: 40.20 | 15: iteration 113970/ 125429 | consumed samples: 29176320 | consumed tokens: 59753103360 | elapsed time per iteration (s): 1.05 | learning rate: 2.376E-05 | global batch size: 256 | lm loss: 1.900250E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.645 | TFLOPs: 40.26 | 15: iteration 113980/ 125429 | consumed samples: 29178880 | consumed tokens: 59758346240 | elapsed time per iteration (s): 1.05 | learning rate: 2.375E-05 | global batch size: 256 | lm loss: 1.854513E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.286 | TFLOPs: 40.20 | 15: iteration 113990/ 125429 | consumed samples: 29181440 | consumed tokens: 59763589120 | elapsed time per iteration (s): 1.08 | learning rate: 2.374E-05 | global batch size: 256 | lm loss: 1.917156E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.066 | TFLOPs: 39.34 | 0: [2022-11-27 05:46:16,030] [INFO] [logging.py:68:log_dist] [Rank 0] step=114000, skipped=0, lr=[2.373627147593246e-05, 2.373627147593246e-05, 2.373627147593246e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 114000/ 125429 | consumed samples: 29184000 | consumed tokens: 59768832000 | elapsed time per iteration (s): 1.04 | learning rate: 2.374E-05 | global batch size: 256 | lm loss: 1.908148E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.780 | TFLOPs: 40.78 | 0: steps: 114000 loss: 1.9132 iter time (s): 1.052 samples/sec: 243.253 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 114000 | lm loss value: 1.876659E+00 | lm loss PPL: 6.531644E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 114000 to checkpoints_1b5 0: [2022-11-27 05:46:16,383] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step114000 is begin to save! 0: [2022-11-27 05:46:16,390] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_01-model_00-model_states.pt... 0: [2022-11-27 05:46:16,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_01-model_00-model_states.pt. 0: [2022-11-27 05:46:16,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_03-model_00-model_states.pt... 0: [2022-11-27 05:46:16,779] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_03-model_00-model_states.pt. 0: [2022-11-27 05:46:16,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_04-model_00-model_states.pt... 0: [2022-11-27 05:46:16,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_04-model_00-model_states.pt. 0: [2022-11-27 05:46:16,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_05-model_00-model_states.pt... 0: [2022-11-27 05:46:17,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_05-model_00-model_states.pt. 0: [2022-11-27 05:46:17,007] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_06-model_00-model_states.pt... 0: [2022-11-27 05:46:17,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_06-model_00-model_states.pt. 0: [2022-11-27 05:46:17,113] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_07-model_00-model_states.pt... 0: [2022-11-27 05:46:17,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_07-model_00-model_states.pt. 0: [2022-11-27 05:46:17,219] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_08-model_00-model_states.pt... 0: [2022-11-27 05:46:17,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_08-model_00-model_states.pt. 0: [2022-11-27 05:46:17,326] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_09-model_00-model_states.pt... 0: [2022-11-27 05:46:17,430] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_09-model_00-model_states.pt. 0: [2022-11-27 05:46:17,430] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_10-model_00-model_states.pt... 0: [2022-11-27 05:46:17,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_10-model_00-model_states.pt. 0: [2022-11-27 05:46:17,539] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_11-model_00-model_states.pt... 0: [2022-11-27 05:46:17,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_11-model_00-model_states.pt. 0: [2022-11-27 05:46:17,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_12-model_00-model_states.pt... 0: [2022-11-27 05:46:17,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_12-model_00-model_states.pt. 0: [2022-11-27 05:46:17,752] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_13-model_00-model_states.pt... 0: [2022-11-27 05:46:17,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_13-model_00-model_states.pt. 0: [2022-11-27 05:46:17,861] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_14-model_00-model_states.pt... 0: [2022-11-27 05:46:17,966] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_14-model_00-model_states.pt. 0: [2022-11-27 05:46:17,967] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_15-model_00-model_states.pt... 0: [2022-11-27 05:46:18,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_15-model_00-model_states.pt. 0: [2022-11-27 05:46:18,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_16-model_00-model_states.pt... 0: [2022-11-27 05:46:18,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_16-model_00-model_states.pt. 0: [2022-11-27 05:46:18,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_17-model_00-model_states.pt... 0: [2022-11-27 05:46:18,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_17-model_00-model_states.pt. 0: [2022-11-27 05:46:18,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_18-model_00-model_states.pt... 0: [2022-11-27 05:46:18,389] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_18-model_00-model_states.pt. 0: [2022-11-27 05:46:18,389] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_19-model_00-model_states.pt... 0: [2022-11-27 05:46:18,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_19-model_00-model_states.pt. 0: [2022-11-27 05:46:18,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_20-model_00-model_states.pt... 0: [2022-11-27 05:46:18,596] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_20-model_00-model_states.pt. 0: [2022-11-27 05:46:18,597] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_21-model_00-model_states.pt... 0: [2022-11-27 05:46:18,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_21-model_00-model_states.pt. 0: [2022-11-27 05:46:18,700] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_22-model_00-model_states.pt... 0: [2022-11-27 05:46:18,802] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_22-model_00-model_states.pt. 0: [2022-11-27 05:46:18,802] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_23-model_00-model_states.pt... 0: [2022-11-27 05:46:18,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_23-model_00-model_states.pt. 0: [2022-11-27 05:46:18,914] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_24-model_00-model_states.pt... 0: [2022-11-27 05:46:19,014] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_24-model_00-model_states.pt. 0: [2022-11-27 05:46:19,015] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_25-model_00-model_states.pt... 0: [2022-11-27 05:46:19,116] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_25-model_00-model_states.pt. 0: [2022-11-27 05:46:19,116] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_26-model_00-model_states.pt... 0: [2022-11-27 05:46:19,222] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_26-model_00-model_states.pt. 0: [2022-11-27 05:46:19,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_27-model_00-model_states.pt... 0: [2022-11-27 05:46:19,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_27-model_00-model_states.pt. 0: [2022-11-27 05:46:19,325] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_28-model_00-model_states.pt... 0: [2022-11-27 05:46:19,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_28-model_00-model_states.pt. 0: [2022-11-27 05:46:19,432] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_29-model_00-model_states.pt... 0: [2022-11-27 05:46:19,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_29-model_00-model_states.pt. 0: [2022-11-27 05:46:19,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_30-model_00-model_states.pt... 0: [2022-11-27 05:46:19,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_30-model_00-model_states.pt. 0: [2022-11-27 05:46:19,638] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/layer_32-model_00-model_states.pt... 0: [2022-11-27 05:46:19,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/layer_32-model_00-model_states.pt. 0: [2022-11-27 05:46:19,644] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step114000/mp_rank_00_model_states.pt 0: [2022-11-27 05:46:19,644] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/mp_rank_00_model_states.pt... 0: [2022-11-27 05:46:19,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/mp_rank_00_model_states.pt. 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 5: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 4: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 8: [2022-11-27 05:46:19,685] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step114000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 0: [2022-11-27 05:46:19,842] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:46:19,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:46:19,843] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,843] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-27 05:46:19,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,844] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,844] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:46:19,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,850] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,850] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-27 05:46:19,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,849] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,849] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,849] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,855] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-27 05:46:19,855] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,858] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 05:46:19,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,841] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 05:46:19,841] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:46:19,853] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 05:46:19,853] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,857] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,857] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,857] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,861] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,861] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,861] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,862] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-27 05:46:19,862] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,862] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-27 05:46:19,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,863] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-27 05:46:19,866] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,866] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,866] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,858] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,858] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,863] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:46:19,863] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:46:19,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,867] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:46:19,867] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,867] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-27 05:46:19,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-27 05:46:19,868] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,868] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,868] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-27 05:46:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,870] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,870] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-27 05:46:19,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,871] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,871] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-27 05:46:19,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-27 05:46:19,872] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:46:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-27 05:46:19,877] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,877] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,877] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,872] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,872] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,845] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,845] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,850] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:46:19,851] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,851] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:46:19,859] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,859] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:46:19,860] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,860] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:46:19,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 05:46:19,876] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,876] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,880] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,880] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 11: [2022-11-27 05:46:19,879] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 05:46:19,879] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 05:46:19,879] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 12: [2022-11-27 05:46:19,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 05:46:19,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,886] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,886] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,886] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 2: [2022-11-27 05:46:19,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 05:46:19,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 05:46:19,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-27 05:46:19,889] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,889] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,889] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-27 05:46:19,891] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,891] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,891] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 15: [2022-11-27 05:46:19,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,875] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,875] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,882] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,882] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,882] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,883] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,883] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,898] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,898] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,898] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 14: [2022-11-27 05:46:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 05:46:19,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 05:46:19,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 6: [2022-11-27 05:46:19,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 05:46:19,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 05:46:19,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:46:19,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 05:46:19,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:46:19,869] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 05:46:19,869] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,873] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:46:19,873] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 05:46:19,873] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 1: [2022-11-27 05:46:19,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 05:46:19,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 05:46:19,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 3: [2022-11-27 05:46:19,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 05:46:19,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 05:46:19,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-27 05:46:19,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 7: [2022-11-27 05:46:19,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 05:46:19,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 05:46:19,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 13: [2022-11-27 05:46:19,897] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 05:46:19,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 05:46:19,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-27 05:46:19,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 05:46:19,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:19,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 10: [2022-11-27 05:46:19,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 05:46:19,921] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 05:46:19,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,878] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,878] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 05:46:19,878] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,884] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 05:46:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,884] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 05:46:19,884] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,887] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,887] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 05:46:19,887] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 4: [2022-11-27 05:46:19,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 05:46:19,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 05:46:19,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 8: [2022-11-27 05:46:19,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 05:46:19,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 05:46:19,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 5: [2022-11-27 05:46:19,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: [2022-11-27 05:46:20,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 05:46:20,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,142] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 05:46:20,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,143] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step114000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 9: [2022-11-27 05:46:20,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step114000 is ready now! 0: successfully saved checkpoint at iteration 114000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3768.91 15: iteration 114010/ 125429 | consumed samples: 29186560 | consumed tokens: 59774074880 | elapsed time per iteration (s): 1.44 | learning rate: 2.373E-05 | global batch size: 256 | lm loss: 1.890662E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 178.015 | TFLOPs: 29.42 | 15: iteration 114020/ 125429 | consumed samples: 29189120 | consumed tokens: 59779317760 | elapsed time per iteration (s): 1.03 | learning rate: 2.372E-05 | global batch size: 256 | lm loss: 1.884752E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.124 | TFLOPs: 41.00 | 15: iteration 114030/ 125429 | consumed samples: 29191680 | consumed tokens: 59784560640 | elapsed time per iteration (s): 1.03 | learning rate: 2.372E-05 | global batch size: 256 | lm loss: 1.906417E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.446 | TFLOPs: 41.22 | 15: iteration 114040/ 125429 | consumed samples: 29194240 | consumed tokens: 59789803520 | elapsed time per iteration (s): 1.02 | learning rate: 2.371E-05 | global batch size: 256 | lm loss: 1.889414E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.178 | TFLOPs: 41.51 | 15: iteration 114050/ 125429 | consumed samples: 29196800 | consumed tokens: 59795046400 | elapsed time per iteration (s): 1.19 | learning rate: 2.370E-05 | global batch size: 256 | lm loss: 1.904996E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.874 | TFLOPs: 35.67 | 15: iteration 114060/ 125429 | consumed samples: 29199360 | consumed tokens: 59800289280 | elapsed time per iteration (s): 1.05 | learning rate: 2.370E-05 | global batch size: 256 | lm loss: 1.881615E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.852 | TFLOPs: 40.30 | 15: iteration 114070/ 125429 | consumed samples: 29201920 | consumed tokens: 59805532160 | elapsed time per iteration (s): 1.04 | learning rate: 2.369E-05 | global batch size: 256 | lm loss: 1.891177E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.705 | TFLOPs: 40.77 | 15: iteration 114080/ 125429 | consumed samples: 29204480 | consumed tokens: 59810775040 | elapsed time per iteration (s): 1.03 | learning rate: 2.368E-05 | global batch size: 256 | lm loss: 1.879999E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.494 | TFLOPs: 41.07 | 15: iteration 114090/ 125429 | consumed samples: 29207040 | consumed tokens: 59816017920 | elapsed time per iteration (s): 1.05 | learning rate: 2.368E-05 | global batch size: 256 | lm loss: 1.855616E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.837 | TFLOPs: 40.30 | 15: iteration 114100/ 125429 | consumed samples: 29209600 | consumed tokens: 59821260800 | elapsed time per iteration (s): 1.06 | learning rate: 2.367E-05 | global batch size: 256 | lm loss: 1.913644E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.165 | TFLOPs: 39.85 | 15: iteration 114110/ 125429 | consumed samples: 29212160 | consumed tokens: 59826503680 | elapsed time per iteration (s): 1.04 | learning rate: 2.367E-05 | global batch size: 256 | lm loss: 1.904034E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.892 | TFLOPs: 40.64 | 15: iteration 114120/ 125429 | consumed samples: 29214720 | consumed tokens: 59831746560 | elapsed time per iteration (s): 1.03 | learning rate: 2.366E-05 | global batch size: 256 | lm loss: 1.912327E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.156 | TFLOPs: 41.17 | 15: iteration 114130/ 125429 | consumed samples: 29217280 | consumed tokens: 59836989440 | elapsed time per iteration (s): 1.05 | learning rate: 2.365E-05 | global batch size: 256 | lm loss: 1.888460E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.790 | TFLOPs: 40.29 | 15: iteration 114140/ 125429 | consumed samples: 29219840 | consumed tokens: 59842232320 | elapsed time per iteration (s): 1.06 | learning rate: 2.365E-05 | global batch size: 256 | lm loss: 1.902588E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.913 | TFLOPs: 39.98 | 15: iteration 114150/ 125429 | consumed samples: 29222400 | consumed tokens: 59847475200 | elapsed time per iteration (s): 1.05 | learning rate: 2.364E-05 | global batch size: 256 | lm loss: 1.893568E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.828 | TFLOPs: 40.46 | 15: iteration 114160/ 125429 | consumed samples: 29224960 | consumed tokens: 59852718080 | elapsed time per iteration (s): 1.19 | learning rate: 2.363E-05 | global batch size: 256 | lm loss: 1.905440E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.309 | TFLOPs: 35.58 | 15: iteration 114170/ 125429 | consumed samples: 29227520 | consumed tokens: 59857960960 | elapsed time per iteration (s): 1.19 | learning rate: 2.363E-05 | global batch size: 256 | lm loss: 1.880923E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.630 | TFLOPs: 35.63 | 15: iteration 114180/ 125429 | consumed samples: 29230080 | consumed tokens: 59863203840 | elapsed time per iteration (s): 1.03 | learning rate: 2.362E-05 | global batch size: 256 | lm loss: 1.905778E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.539 | TFLOPs: 40.91 | 15: iteration 114190/ 125429 | consumed samples: 29232640 | consumed tokens: 59868446720 | elapsed time per iteration (s): 1.04 | learning rate: 2.361E-05 | global batch size: 256 | lm loss: 1.915307E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.232 | TFLOPs: 40.53 | 15: iteration 114200/ 125429 | consumed samples: 29235200 | consumed tokens: 59873689600 | elapsed time per iteration (s): 1.03 | learning rate: 2.361E-05 | global batch size: 256 | lm loss: 1.888248E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.660 | TFLOPs: 40.93 | 15: iteration 114210/ 125429 | consumed samples: 29237760 | consumed tokens: 59878932480 | elapsed time per iteration (s): 1.02 | learning rate: 2.360E-05 | global batch size: 256 | lm loss: 1.879179E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.946 | TFLOPs: 41.47 | 15: iteration 114220/ 125429 | consumed samples: 29240320 | consumed tokens: 59884175360 | elapsed time per iteration (s): 1.06 | learning rate: 2.359E-05 | global batch size: 256 | lm loss: 1.888980E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.752 | TFLOPs: 39.79 | 15: iteration 114230/ 125429 | consumed samples: 29242880 | consumed tokens: 59889418240 | elapsed time per iteration (s): 1.05 | learning rate: 2.359E-05 | global batch size: 256 | lm loss: 1.914083E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.748 | TFLOPs: 40.12 | 15: iteration 114240/ 125429 | consumed samples: 29245440 | consumed tokens: 59894661120 | elapsed time per iteration (s): 1.03 | learning rate: 2.358E-05 | global batch size: 256 | lm loss: 1.906408E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.424 | TFLOPs: 41.05 | 15: iteration 114250/ 125429 | consumed samples: 29248000 | consumed tokens: 59899904000 | elapsed time per iteration (s): 1.03 | learning rate: 2.358E-05 | global batch size: 256 | lm loss: 1.877203E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.702 | TFLOPs: 41.10 | 15: iteration 114260/ 125429 | consumed samples: 29250560 | consumed tokens: 59905146880 | elapsed time per iteration (s): 1.05 | learning rate: 2.357E-05 | global batch size: 256 | lm loss: 1.867047E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.947 | TFLOPs: 40.31 | 15: iteration 114270/ 125429 | consumed samples: 29253120 | consumed tokens: 59910389760 | elapsed time per iteration (s): 1.08 | learning rate: 2.356E-05 | global batch size: 256 | lm loss: 1.916869E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.995 | TFLOPs: 39.17 | 15: iteration 114280/ 125429 | consumed samples: 29255680 | consumed tokens: 59915632640 | elapsed time per iteration (s): 1.08 | learning rate: 2.356E-05 | global batch size: 256 | lm loss: 1.888397E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.416 | TFLOPs: 39.23 | 15: iteration 114290/ 125429 | consumed samples: 29258240 | consumed tokens: 59920875520 | elapsed time per iteration (s): 1.02 | learning rate: 2.355E-05 | global batch size: 256 | lm loss: 1.892789E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.500 | TFLOPs: 41.56 | 15: iteration 114300/ 125429 | consumed samples: 29260800 | consumed tokens: 59926118400 | elapsed time per iteration (s): 1.03 | learning rate: 2.354E-05 | global batch size: 256 | lm loss: 1.904687E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.568 | TFLOPs: 41.08 | 15: iteration 114310/ 125429 | consumed samples: 29263360 | consumed tokens: 59931361280 | elapsed time per iteration (s): 1.05 | learning rate: 2.354E-05 | global batch size: 256 | lm loss: 1.905670E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.560 | TFLOPs: 40.25 | 15: iteration 114320/ 125429 | consumed samples: 29265920 | consumed tokens: 59936604160 | elapsed time per iteration (s): 1.03 | learning rate: 2.353E-05 | global batch size: 256 | lm loss: 1.871704E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.692 | TFLOPs: 41.26 | 15: iteration 114330/ 125429 | consumed samples: 29268480 | consumed tokens: 59941847040 | elapsed time per iteration (s): 1.04 | learning rate: 2.353E-05 | global batch size: 256 | lm loss: 1.889571E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.135 | TFLOPs: 40.51 | 15: iteration 114340/ 125429 | consumed samples: 29271040 | consumed tokens: 59947089920 | elapsed time per iteration (s): 1.09 | learning rate: 2.352E-05 | global batch size: 256 | lm loss: 1.936996E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.656 | TFLOPs: 38.94 | 15: iteration 114350/ 125429 | consumed samples: 29273600 | consumed tokens: 59952332800 | elapsed time per iteration (s): 1.03 | learning rate: 2.351E-05 | global batch size: 256 | lm loss: 1.877966E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.380 | TFLOPs: 41.05 | 15: iteration 114360/ 125429 | consumed samples: 29276160 | consumed tokens: 59957575680 | elapsed time per iteration (s): 1.07 | learning rate: 2.351E-05 | global batch size: 256 | lm loss: 1.923674E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.930 | TFLOPs: 39.65 | 15: iteration 114370/ 125429 | consumed samples: 29278720 | consumed tokens: 59962818560 | elapsed time per iteration (s): 1.05 | learning rate: 2.350E-05 | global batch size: 256 | lm loss: 1.887092E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.765 | TFLOPs: 40.12 | 15: iteration 114380/ 125429 | consumed samples: 29281280 | consumed tokens: 59968061440 | elapsed time per iteration (s): 1.12 | learning rate: 2.349E-05 | global batch size: 256 | lm loss: 1.874713E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.990 | TFLOPs: 37.68 | 15: iteration 114390/ 125429 | consumed samples: 29283840 | consumed tokens: 59973304320 | elapsed time per iteration (s): 1.07 | learning rate: 2.349E-05 | global batch size: 256 | lm loss: 1.888329E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.833 | TFLOPs: 39.47 | 15: iteration 114400/ 125429 | consumed samples: 29286400 | consumed tokens: 59978547200 | elapsed time per iteration (s): 1.03 | learning rate: 2.348E-05 | global batch size: 256 | lm loss: 1.878313E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.097 | TFLOPs: 41.17 | 15: iteration 114410/ 125429 | consumed samples: 29288960 | consumed tokens: 59983790080 | elapsed time per iteration (s): 1.08 | learning rate: 2.347E-05 | global batch size: 256 | lm loss: 1.908295E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.962 | TFLOPs: 39.16 | 15: iteration 114420/ 125429 | consumed samples: 29291520 | consumed tokens: 59989032960 | elapsed time per iteration (s): 1.06 | learning rate: 2.347E-05 | global batch size: 256 | lm loss: 1.893347E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.410 | TFLOPs: 40.06 | 15: iteration 114430/ 125429 | consumed samples: 29294080 | consumed tokens: 59994275840 | elapsed time per iteration (s): 1.02 | learning rate: 2.346E-05 | global batch size: 256 | lm loss: 1.872593E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.520 | TFLOPs: 41.40 | 15: iteration 114440/ 125429 | consumed samples: 29296640 | consumed tokens: 59999518720 | elapsed time per iteration (s): 1.06 | learning rate: 2.346E-05 | global batch size: 256 | lm loss: 1.880522E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.443 | TFLOPs: 39.74 | 15: iteration 114450/ 125429 | consumed samples: 29299200 | consumed tokens: 60004761600 | elapsed time per iteration (s): 1.04 | learning rate: 2.345E-05 | global batch size: 256 | lm loss: 1.898078E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.267 | TFLOPs: 40.53 | 15: iteration 114460/ 125429 | consumed samples: 29301760 | consumed tokens: 60010004480 | elapsed time per iteration (s): 1.04 | learning rate: 2.344E-05 | global batch size: 256 | lm loss: 1.879941E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.133 | TFLOPs: 40.84 | 15: iteration 114470/ 125429 | consumed samples: 29304320 | consumed tokens: 60015247360 | elapsed time per iteration (s): 1.07 | learning rate: 2.344E-05 | global batch size: 256 | lm loss: 1.893882E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.282 | TFLOPs: 39.71 | 15: iteration 114480/ 125429 | consumed samples: 29306880 | consumed tokens: 60020490240 | elapsed time per iteration (s): 1.09 | learning rate: 2.343E-05 | global batch size: 256 | lm loss: 1.888303E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.206 | TFLOPs: 38.87 | 15: iteration 114490/ 125429 | consumed samples: 29309440 | consumed tokens: 60025733120 | elapsed time per iteration (s): 1.06 | learning rate: 2.342E-05 | global batch size: 256 | lm loss: 1.913814E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.613 | TFLOPs: 39.93 | 15: iteration 114500/ 125429 | consumed samples: 29312000 | consumed tokens: 60030976000 | elapsed time per iteration (s): 1.03 | learning rate: 2.342E-05 | global batch size: 256 | lm loss: 1.882417E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.740 | TFLOPs: 41.11 | 15: iteration 114510/ 125429 | consumed samples: 29314560 | consumed tokens: 60036218880 | elapsed time per iteration (s): 1.05 | learning rate: 2.341E-05 | global batch size: 256 | lm loss: 1.866475E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.823 | TFLOPs: 40.46 | 15: iteration 114520/ 125429 | consumed samples: 29317120 | consumed tokens: 60041461760 | elapsed time per iteration (s): 1.04 | learning rate: 2.341E-05 | global batch size: 256 | lm loss: 1.915770E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.196 | TFLOPs: 40.69 | 15: iteration 114530/ 125429 | consumed samples: 29319680 | consumed tokens: 60046704640 | elapsed time per iteration (s): 1.03 | learning rate: 2.340E-05 | global batch size: 256 | lm loss: 1.903184E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.802 | TFLOPs: 40.95 | 15: iteration 114540/ 125429 | consumed samples: 29322240 | consumed tokens: 60051947520 | elapsed time per iteration (s): 1.06 | learning rate: 2.339E-05 | global batch size: 256 | lm loss: 1.885201E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.485 | TFLOPs: 39.74 | 15: iteration 114550/ 125429 | consumed samples: 29324800 | consumed tokens: 60057190400 | elapsed time per iteration (s): 1.05 | learning rate: 2.339E-05 | global batch size: 256 | lm loss: 1.933978E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.913 | TFLOPs: 40.47 | 15: iteration 114560/ 125429 | consumed samples: 29327360 | consumed tokens: 60062433280 | elapsed time per iteration (s): 1.06 | learning rate: 2.338E-05 | global batch size: 256 | lm loss: 1.867510E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.236 | TFLOPs: 39.87 | 15: iteration 114570/ 125429 | consumed samples: 29329920 | consumed tokens: 60067676160 | elapsed time per iteration (s): 1.03 | learning rate: 2.338E-05 | global batch size: 256 | lm loss: 1.917708E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.873 | TFLOPs: 41.13 | 15: iteration 114580/ 125429 | consumed samples: 29332480 | consumed tokens: 60072919040 | elapsed time per iteration (s): 1.03 | learning rate: 2.337E-05 | global batch size: 256 | lm loss: 1.915606E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.306 | TFLOPs: 41.03 | 15: iteration 114590/ 125429 | consumed samples: 29335040 | consumed tokens: 60078161920 | elapsed time per iteration (s): 1.04 | learning rate: 2.336E-05 | global batch size: 256 | lm loss: 1.886585E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.410 | TFLOPs: 40.56 | 15: iteration 114600/ 125429 | consumed samples: 29337600 | consumed tokens: 60083404800 | elapsed time per iteration (s): 1.05 | learning rate: 2.336E-05 | global batch size: 256 | lm loss: 1.921119E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.329 | TFLOPs: 40.21 | 15: iteration 114610/ 125429 | consumed samples: 29340160 | consumed tokens: 60088647680 | elapsed time per iteration (s): 1.03 | learning rate: 2.335E-05 | global batch size: 256 | lm loss: 1.897107E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.005 | TFLOPs: 41.15 | 15: iteration 114620/ 125429 | consumed samples: 29342720 | consumed tokens: 60093890560 | elapsed time per iteration (s): 1.03 | learning rate: 2.334E-05 | global batch size: 256 | lm loss: 1.903257E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.408 | TFLOPs: 40.89 | 15: iteration 114630/ 125429 | consumed samples: 29345280 | consumed tokens: 60099133440 | elapsed time per iteration (s): 1.04 | learning rate: 2.334E-05 | global batch size: 256 | lm loss: 1.900934E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.449 | TFLOPs: 40.56 | 15: iteration 114640/ 125429 | consumed samples: 29347840 | consumed tokens: 60104376320 | elapsed time per iteration (s): 1.05 | learning rate: 2.333E-05 | global batch size: 256 | lm loss: 1.868691E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.020 | TFLOPs: 40.33 | 15: iteration 114650/ 125429 | consumed samples: 29350400 | consumed tokens: 60109619200 | elapsed time per iteration (s): 1.05 | learning rate: 2.333E-05 | global batch size: 256 | lm loss: 1.908461E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.014 | TFLOPs: 40.16 | 15: iteration 114660/ 125429 | consumed samples: 29352960 | consumed tokens: 60114862080 | elapsed time per iteration (s): 1.04 | learning rate: 2.332E-05 | global batch size: 256 | lm loss: 1.878350E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.828 | TFLOPs: 40.62 | 15: iteration 114670/ 125429 | consumed samples: 29355520 | consumed tokens: 60120104960 | elapsed time per iteration (s): 1.07 | learning rate: 2.331E-05 | global batch size: 256 | lm loss: 1.872273E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.515 | TFLOPs: 39.58 | 15: iteration 114680/ 125429 | consumed samples: 29358080 | consumed tokens: 60125347840 | elapsed time per iteration (s): 1.09 | learning rate: 2.331E-05 | global batch size: 256 | lm loss: 1.941628E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.746 | TFLOPs: 38.96 | 15: iteration 114690/ 125429 | consumed samples: 29360640 | consumed tokens: 60130590720 | elapsed time per iteration (s): 1.09 | learning rate: 2.330E-05 | global batch size: 256 | lm loss: 1.892279E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.658 | TFLOPs: 38.78 | 15: iteration 114700/ 125429 | consumed samples: 29363200 | consumed tokens: 60135833600 | elapsed time per iteration (s): 1.07 | learning rate: 2.330E-05 | global batch size: 256 | lm loss: 1.912557E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.632 | TFLOPs: 39.44 | 15: iteration 114710/ 125429 | consumed samples: 29365760 | consumed tokens: 60141076480 | elapsed time per iteration (s): 1.04 | learning rate: 2.329E-05 | global batch size: 256 | lm loss: 1.908503E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.304 | TFLOPs: 40.87 | 15: iteration 114720/ 125429 | consumed samples: 29368320 | consumed tokens: 60146319360 | elapsed time per iteration (s): 1.07 | learning rate: 2.328E-05 | global batch size: 256 | lm loss: 1.897303E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.036 | TFLOPs: 39.67 | 15: iteration 114730/ 125429 | consumed samples: 29370880 | consumed tokens: 60151562240 | elapsed time per iteration (s): 1.05 | learning rate: 2.328E-05 | global batch size: 256 | lm loss: 1.914314E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.307 | TFLOPs: 40.21 | 15: iteration 114740/ 125429 | consumed samples: 29373440 | consumed tokens: 60156805120 | elapsed time per iteration (s): 1.04 | learning rate: 2.327E-05 | global batch size: 256 | lm loss: 1.890096E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.134 | TFLOPs: 40.84 | 15: iteration 114750/ 125429 | consumed samples: 29376000 | consumed tokens: 60162048000 | elapsed time per iteration (s): 1.05 | learning rate: 2.326E-05 | global batch size: 256 | lm loss: 1.900555E+00 | grad norm: 0.150 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.213 | TFLOPs: 40.36 | 15: iteration 114760/ 125429 | consumed samples: 29378560 | consumed tokens: 60167290880 | elapsed time per iteration (s): 1.10 | learning rate: 2.326E-05 | global batch size: 256 | lm loss: 1.871054E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.243 | TFLOPs: 38.38 | 15: iteration 114770/ 125429 | consumed samples: 29381120 | consumed tokens: 60172533760 | elapsed time per iteration (s): 1.08 | learning rate: 2.325E-05 | global batch size: 256 | lm loss: 1.880090E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.246 | TFLOPs: 39.21 | 15: iteration 114780/ 125429 | consumed samples: 29383680 | consumed tokens: 60177776640 | elapsed time per iteration (s): 1.07 | learning rate: 2.325E-05 | global batch size: 256 | lm loss: 1.892025E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.324 | TFLOPs: 39.72 | 15: iteration 114790/ 125429 | consumed samples: 29386240 | consumed tokens: 60183019520 | elapsed time per iteration (s): 1.09 | learning rate: 2.324E-05 | global batch size: 256 | lm loss: 1.892580E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.402 | TFLOPs: 38.74 | 15: iteration 114800/ 125429 | consumed samples: 29388800 | consumed tokens: 60188262400 | elapsed time per iteration (s): 1.06 | learning rate: 2.323E-05 | global batch size: 256 | lm loss: 1.917103E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.840 | TFLOPs: 39.80 | 15: iteration 114810/ 125429 | consumed samples: 29391360 | consumed tokens: 60193505280 | elapsed time per iteration (s): 1.03 | learning rate: 2.323E-05 | global batch size: 256 | lm loss: 1.912861E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.819 | TFLOPs: 41.12 | 15: iteration 114820/ 125429 | consumed samples: 29393920 | consumed tokens: 60198748160 | elapsed time per iteration (s): 1.09 | learning rate: 2.322E-05 | global batch size: 256 | lm loss: 1.886220E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.803 | TFLOPs: 38.97 | 15: iteration 114830/ 125429 | consumed samples: 29396480 | consumed tokens: 60203991040 | elapsed time per iteration (s): 1.04 | learning rate: 2.322E-05 | global batch size: 256 | lm loss: 1.899284E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.128 | TFLOPs: 40.51 | 15: iteration 114840/ 125429 | consumed samples: 29399040 | consumed tokens: 60209233920 | elapsed time per iteration (s): 1.04 | learning rate: 2.321E-05 | global batch size: 256 | lm loss: 1.893160E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.659 | TFLOPs: 40.60 | 15: iteration 114850/ 125429 | consumed samples: 29401600 | consumed tokens: 60214476800 | elapsed time per iteration (s): 1.03 | learning rate: 2.320E-05 | global batch size: 256 | lm loss: 1.902168E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.574 | TFLOPs: 41.08 | 15: iteration 114860/ 125429 | consumed samples: 29404160 | consumed tokens: 60219719680 | elapsed time per iteration (s): 1.05 | learning rate: 2.320E-05 | global batch size: 256 | lm loss: 1.881125E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.443 | TFLOPs: 40.40 | 15: iteration 114870/ 125429 | consumed samples: 29406720 | consumed tokens: 60224962560 | elapsed time per iteration (s): 1.03 | learning rate: 2.319E-05 | global batch size: 256 | lm loss: 1.892513E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.814 | TFLOPs: 40.95 | 15: iteration 114880/ 125429 | consumed samples: 29409280 | consumed tokens: 60230205440 | elapsed time per iteration (s): 1.05 | learning rate: 2.319E-05 | global batch size: 256 | lm loss: 1.900738E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.524 | TFLOPs: 40.41 | 15: iteration 114890/ 125429 | consumed samples: 29411840 | consumed tokens: 60235448320 | elapsed time per iteration (s): 1.16 | learning rate: 2.318E-05 | global batch size: 256 | lm loss: 1.895583E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 221.310 | TFLOPs: 36.57 | 15: iteration 114900/ 125429 | consumed samples: 29414400 | consumed tokens: 60240691200 | elapsed time per iteration (s): 1.04 | learning rate: 2.317E-05 | global batch size: 256 | lm loss: 1.890101E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.542 | TFLOPs: 40.58 | 15: iteration 114910/ 125429 | consumed samples: 29416960 | consumed tokens: 60245934080 | elapsed time per iteration (s): 1.06 | learning rate: 2.317E-05 | global batch size: 256 | lm loss: 1.884898E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.288 | TFLOPs: 40.04 | 15: iteration 114920/ 125429 | consumed samples: 29419520 | consumed tokens: 60251176960 | elapsed time per iteration (s): 1.05 | learning rate: 2.316E-05 | global batch size: 256 | lm loss: 1.901348E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.937 | TFLOPs: 40.48 | 15: iteration 114930/ 125429 | consumed samples: 29422080 | consumed tokens: 60256419840 | elapsed time per iteration (s): 1.10 | learning rate: 2.316E-05 | global batch size: 256 | lm loss: 1.905861E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.494 | TFLOPs: 38.42 | 15: iteration 114940/ 125429 | consumed samples: 29424640 | consumed tokens: 60261662720 | elapsed time per iteration (s): 1.09 | learning rate: 2.315E-05 | global batch size: 256 | lm loss: 1.890713E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.254 | TFLOPs: 38.71 | 15: iteration 114950/ 125429 | consumed samples: 29427200 | consumed tokens: 60266905600 | elapsed time per iteration (s): 1.02 | learning rate: 2.314E-05 | global batch size: 256 | lm loss: 1.880472E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.968 | TFLOPs: 41.31 | 15: iteration 114960/ 125429 | consumed samples: 29429760 | consumed tokens: 60272148480 | elapsed time per iteration (s): 1.05 | learning rate: 2.314E-05 | global batch size: 256 | lm loss: 1.902592E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.553 | TFLOPs: 40.41 | 15: iteration 114970/ 125429 | consumed samples: 29432320 | consumed tokens: 60277391360 | elapsed time per iteration (s): 1.03 | learning rate: 2.313E-05 | global batch size: 256 | lm loss: 1.906428E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.081 | TFLOPs: 41.00 | 15: iteration 114980/ 125429 | consumed samples: 29434880 | consumed tokens: 60282634240 | elapsed time per iteration (s): 1.05 | learning rate: 2.313E-05 | global batch size: 256 | lm loss: 1.877322E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.296 | TFLOPs: 40.37 | 15: iteration 114990/ 125429 | consumed samples: 29437440 | consumed tokens: 60287877120 | elapsed time per iteration (s): 1.10 | learning rate: 2.312E-05 | global batch size: 256 | lm loss: 1.863110E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.207 | TFLOPs: 38.54 | 15: iteration 115000/ 125429 | consumed samples: 29440000 | consumed tokens: 60293120000 | elapsed time per iteration (s): 1.07 | learning rate: 2.311E-05 | global batch size: 256 | lm loss: 1.882727E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.357 | TFLOPs: 39.56 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 115000 | lm loss value: 1.981556E+00 | lm loss PPL: 7.254024E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 115000 to checkpoints_1b5 0: [2022-11-27 06:03:56,872] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step115000 is begin to save! 0: [2022-11-27 06:03:56,882] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:03:57,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:03:57,141] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:03:57,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:03:57,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:03:57,359] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:03:57,360] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:03:57,472] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:03:57,472] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:03:57,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:03:57,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:03:57,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:03:57,697] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:03:57,808] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:03:57,808] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:03:57,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:03:57,918] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:03:58,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:03:58,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:03:58,133] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:03:58,133] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:03:58,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:03:58,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:03:58,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:03:58,355] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:03:58,469] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:03:58,470] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:03:58,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:03:58,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:03:58,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:03:58,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:03:58,815] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:03:58,816] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:03:58,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:03:58,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:03:59,049] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:03:59,049] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:03:59,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:03:59,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:03:59,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:03:59,275] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:03:59,386] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:03:59,386] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:03:59,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:03:59,496] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:03:59,607] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:03:59,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:03:59,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:03:59,717] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:03:59,829] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:03:59,829] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:03:59,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:03:59,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:04:00,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:04:00,040] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_29-model_00-model_states.pt... 0: [2022-11-27 06:04:00,151] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_29-model_00-model_states.pt. 0: [2022-11-27 06:04:00,152] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:04:00,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:04:00,263] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/layer_32-model_00-model_states.pt... 0: [2022-11-27 06:04:00,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/layer_32-model_00-model_states.pt. 0: [2022-11-27 06:04:00,268] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step115000/mp_rank_00_model_states.pt 0: [2022-11-27 06:04:00,268] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:04:00,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/mp_rank_00_model_states.pt. 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:04:00,316] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step115000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:04:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,477] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,477] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 06:04:00,477] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-27 06:04:00,483] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:04:00,483] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,484] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-27 06:04:00,484] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-27 06:04:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-27 06:04:00,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:04:00,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,486] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:04:00,486] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,486] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-27 06:04:00,487] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,487] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,487] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-27 06:04:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:04:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,488] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 0: [2022-11-27 06:04:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,488] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,488] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-27 06:04:00,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 06:04:00,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,491] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:04:00,491] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,491] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-27 06:04:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:04:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,492] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,492] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 06:04:00,492] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-27 06:04:00,493] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,493] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,493] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:04:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,495] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,497] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,497] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 06:04:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 06:04:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-27 06:04:00,498] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,498] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 06:04:00,498] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:04:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-27 06:04:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 06:04:00,499] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:04:00,499] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,499] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,500] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 06:04:00,500] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:04:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,501] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,501] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,501] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-27 06:04:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 06:04:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:04:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,502] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,502] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:04:00,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:04:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,503] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-27 06:04:00,503] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,495] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,495] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:04:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:04:00,496] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:04:00,496] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,505] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 06:04:00,505] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 1: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:04:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,508] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 06:04:00,508] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 06:04:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 06:04:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,509] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:04:00,509] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,509] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-27 06:04:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,511] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,511] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,510] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,510] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,512] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,512] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,513] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:04:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 06:04:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 06:04:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:04:00,515] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,515] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:04:00,516] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,516] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:04:00,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 06:04:00,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 1: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:04:00,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:04:00,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 10: [2022-11-27 06:04:00,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 7: [2022-11-27 06:04:00,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:04:00,521] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:04:00,521] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 8: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:04:00,519] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:04:00,519] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-27 06:04:00,523] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:04:00,523] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 06:04:00,523] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 6: [2022-11-27 06:04:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:04:00,527] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 06:04:00,527] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-27 06:04:00,527] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 06:04:00,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 3: [2022-11-27 06:04:00,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:04:00,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 06:04:00,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,489] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,489] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,490] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,490] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,494] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,494] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,494] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,506] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,506] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,507] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,507] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,507] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 13: [2022-11-27 06:04:00,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:04:00,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 06:04:00,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 4: [2022-11-27 06:04:00,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:04:00,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 06:04:00,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,534] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 9: [2022-11-27 06:04:00,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:04:00,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:04:00,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 11: [2022-11-27 06:04:00,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:04:00,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: [2022-11-27 06:04:00,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 06:04:00,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 14: [2022-11-27 06:04:00,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:04:00,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 06:04:00,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 5: [2022-11-27 06:04:00,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:04:00,591] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:04:00,591] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,598] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 06:04:00,598] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 2: [2022-11-27 06:04:00,598] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:04:00,599] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:04:00,599] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 15: [2022-11-27 06:04:00,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step115000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 12: [2022-11-27 06:04:00,711] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step115000 is ready now! 0: successfully saved checkpoint at iteration 115000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3901.00 15: iteration 115010/ 125429 | consumed samples: 29442560 | consumed tokens: 60298362880 | elapsed time per iteration (s): 1.47 | learning rate: 2.311E-05 | global batch size: 256 | lm loss: 1.889919E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 174.434 | TFLOPs: 28.83 | 15: iteration 115020/ 125429 | consumed samples: 29445120 | consumed tokens: 60303605760 | elapsed time per iteration (s): 1.04 | learning rate: 2.310E-05 | global batch size: 256 | lm loss: 1.862373E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.163 | TFLOPs: 40.52 | 15: iteration 115030/ 125429 | consumed samples: 29447680 | consumed tokens: 60308848640 | elapsed time per iteration (s): 1.04 | learning rate: 2.310E-05 | global batch size: 256 | lm loss: 1.870070E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.135 | TFLOPs: 40.68 | 15: iteration 115040/ 125429 | consumed samples: 29450240 | consumed tokens: 60314091520 | elapsed time per iteration (s): 1.03 | learning rate: 2.309E-05 | global batch size: 256 | lm loss: 1.905089E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.346 | TFLOPs: 41.21 | 15: iteration 115050/ 125429 | consumed samples: 29452800 | consumed tokens: 60319334400 | elapsed time per iteration (s): 1.05 | learning rate: 2.309E-05 | global batch size: 256 | lm loss: 1.912028E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.768 | TFLOPs: 40.45 | 15: iteration 115060/ 125429 | consumed samples: 29455360 | consumed tokens: 60324577280 | elapsed time per iteration (s): 1.05 | learning rate: 2.308E-05 | global batch size: 256 | lm loss: 1.901980E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.989 | TFLOPs: 40.16 | 15: iteration 115070/ 125429 | consumed samples: 29457920 | consumed tokens: 60329820160 | elapsed time per iteration (s): 1.05 | learning rate: 2.307E-05 | global batch size: 256 | lm loss: 1.879718E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.932 | TFLOPs: 40.15 | 15: iteration 115080/ 125429 | consumed samples: 29460480 | consumed tokens: 60335063040 | elapsed time per iteration (s): 1.03 | learning rate: 2.307E-05 | global batch size: 256 | lm loss: 1.877642E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.134 | TFLOPs: 41.01 | 15: iteration 115090/ 125429 | consumed samples: 29463040 | consumed tokens: 60340305920 | elapsed time per iteration (s): 1.05 | learning rate: 2.306E-05 | global batch size: 256 | lm loss: 1.911624E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.397 | TFLOPs: 40.39 | 15: iteration 115100/ 125429 | consumed samples: 29465600 | consumed tokens: 60345548800 | elapsed time per iteration (s): 1.02 | learning rate: 2.306E-05 | global batch size: 256 | lm loss: 1.879692E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.932 | TFLOPs: 41.30 | 15: iteration 115110/ 125429 | consumed samples: 29468160 | consumed tokens: 60350791680 | elapsed time per iteration (s): 1.03 | learning rate: 2.305E-05 | global batch size: 256 | lm loss: 1.899516E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.077 | TFLOPs: 41.16 | 15: iteration 115120/ 125429 | consumed samples: 29470720 | consumed tokens: 60356034560 | elapsed time per iteration (s): 1.08 | learning rate: 2.304E-05 | global batch size: 256 | lm loss: 1.895006E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.697 | TFLOPs: 39.28 | 15: iteration 115130/ 125429 | consumed samples: 29473280 | consumed tokens: 60361277440 | elapsed time per iteration (s): 1.04 | learning rate: 2.304E-05 | global batch size: 256 | lm loss: 1.889823E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.559 | TFLOPs: 40.58 | 15: iteration 115140/ 125429 | consumed samples: 29475840 | consumed tokens: 60366520320 | elapsed time per iteration (s): 1.03 | learning rate: 2.303E-05 | global batch size: 256 | lm loss: 1.870785E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.280 | TFLOPs: 41.20 | 15: iteration 115150/ 125429 | consumed samples: 29478400 | consumed tokens: 60371763200 | elapsed time per iteration (s): 1.03 | learning rate: 2.303E-05 | global batch size: 256 | lm loss: 1.894758E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.957 | TFLOPs: 40.98 | 15: iteration 115160/ 125429 | consumed samples: 29480960 | consumed tokens: 60377006080 | elapsed time per iteration (s): 1.05 | learning rate: 2.302E-05 | global batch size: 256 | lm loss: 1.878424E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.233 | TFLOPs: 40.36 | 15: iteration 115170/ 125429 | consumed samples: 29483520 | consumed tokens: 60382248960 | elapsed time per iteration (s): 1.07 | learning rate: 2.301E-05 | global batch size: 256 | lm loss: 1.879133E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.215 | TFLOPs: 39.70 | 15: iteration 115180/ 125429 | consumed samples: 29486080 | consumed tokens: 60387491840 | elapsed time per iteration (s): 1.03 | learning rate: 2.301E-05 | global batch size: 256 | lm loss: 1.921967E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.979 | TFLOPs: 40.98 | 15: iteration 115190/ 125429 | consumed samples: 29488640 | consumed tokens: 60392734720 | elapsed time per iteration (s): 1.09 | learning rate: 2.300E-05 | global batch size: 256 | lm loss: 1.916263E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.933 | TFLOPs: 38.66 | 15: iteration 115200/ 125429 | consumed samples: 29491200 | consumed tokens: 60397977600 | elapsed time per iteration (s): 1.06 | learning rate: 2.300E-05 | global batch size: 256 | lm loss: 1.894492E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.727 | TFLOPs: 39.95 | 15: iteration 115210/ 125429 | consumed samples: 29493760 | consumed tokens: 60403220480 | elapsed time per iteration (s): 1.07 | learning rate: 2.299E-05 | global batch size: 256 | lm loss: 1.910383E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.243 | TFLOPs: 39.54 | 15: iteration 115220/ 125429 | consumed samples: 29496320 | consumed tokens: 60408463360 | elapsed time per iteration (s): 1.04 | learning rate: 2.299E-05 | global batch size: 256 | lm loss: 1.904800E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.957 | TFLOPs: 40.65 | 15: iteration 115230/ 125429 | consumed samples: 29498880 | consumed tokens: 60413706240 | elapsed time per iteration (s): 1.07 | learning rate: 2.298E-05 | global batch size: 256 | lm loss: 1.909517E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.275 | TFLOPs: 39.71 | 15: iteration 115240/ 125429 | consumed samples: 29501440 | consumed tokens: 60418949120 | elapsed time per iteration (s): 1.04 | learning rate: 2.297E-05 | global batch size: 256 | lm loss: 1.909595E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.081 | TFLOPs: 40.83 | 15: iteration 115250/ 125429 | consumed samples: 29504000 | consumed tokens: 60424192000 | elapsed time per iteration (s): 1.06 | learning rate: 2.297E-05 | global batch size: 256 | lm loss: 1.903018E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.582 | TFLOPs: 39.92 | 15: iteration 115260/ 125429 | consumed samples: 29506560 | consumed tokens: 60429434880 | elapsed time per iteration (s): 1.04 | learning rate: 2.296E-05 | global batch size: 256 | lm loss: 1.879936E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.743 | TFLOPs: 40.61 | 15: iteration 115270/ 125429 | consumed samples: 29509120 | consumed tokens: 60434677760 | elapsed time per iteration (s): 1.06 | learning rate: 2.296E-05 | global batch size: 256 | lm loss: 1.887056E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.267 | TFLOPs: 40.04 | 15: iteration 115280/ 125429 | consumed samples: 29511680 | consumed tokens: 60439920640 | elapsed time per iteration (s): 1.05 | learning rate: 2.295E-05 | global batch size: 256 | lm loss: 1.894676E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.610 | TFLOPs: 40.42 | 15: iteration 115290/ 125429 | consumed samples: 29514240 | consumed tokens: 60445163520 | elapsed time per iteration (s): 1.05 | learning rate: 2.294E-05 | global batch size: 256 | lm loss: 1.895528E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.886 | TFLOPs: 40.14 | 15: iteration 115300/ 125429 | consumed samples: 29516800 | consumed tokens: 60450406400 | elapsed time per iteration (s): 1.04 | learning rate: 2.294E-05 | global batch size: 256 | lm loss: 1.909184E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.995 | TFLOPs: 40.65 | 15: iteration 115310/ 125429 | consumed samples: 29519360 | consumed tokens: 60455649280 | elapsed time per iteration (s): 1.03 | learning rate: 2.293E-05 | global batch size: 256 | lm loss: 1.898191E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.405 | TFLOPs: 40.89 | 15: iteration 115320/ 125429 | consumed samples: 29521920 | consumed tokens: 60460892160 | elapsed time per iteration (s): 1.04 | learning rate: 2.293E-05 | global batch size: 256 | lm loss: 1.870224E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.109 | TFLOPs: 40.67 | 15: iteration 115330/ 125429 | consumed samples: 29524480 | consumed tokens: 60466135040 | elapsed time per iteration (s): 1.04 | learning rate: 2.292E-05 | global batch size: 256 | lm loss: 1.893085E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.858 | TFLOPs: 40.80 | 15: iteration 115340/ 125429 | consumed samples: 29527040 | consumed tokens: 60471377920 | elapsed time per iteration (s): 1.09 | learning rate: 2.292E-05 | global batch size: 256 | lm loss: 1.892285E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.707 | TFLOPs: 38.95 | 15: iteration 115350/ 125429 | consumed samples: 29529600 | consumed tokens: 60476620800 | elapsed time per iteration (s): 1.06 | learning rate: 2.291E-05 | global batch size: 256 | lm loss: 1.889486E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.339 | TFLOPs: 39.88 | 15: iteration 115360/ 125429 | consumed samples: 29532160 | consumed tokens: 60481863680 | elapsed time per iteration (s): 1.09 | learning rate: 2.290E-05 | global batch size: 256 | lm loss: 1.907747E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.118 | TFLOPs: 38.86 | 15: iteration 115370/ 125429 | consumed samples: 29534720 | consumed tokens: 60487106560 | elapsed time per iteration (s): 1.03 | learning rate: 2.290E-05 | global batch size: 256 | lm loss: 1.873390E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.474 | TFLOPs: 41.06 | 15: iteration 115380/ 125429 | consumed samples: 29537280 | consumed tokens: 60492349440 | elapsed time per iteration (s): 1.04 | learning rate: 2.289E-05 | global batch size: 256 | lm loss: 1.881960E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.047 | TFLOPs: 40.66 | 15: iteration 115390/ 125429 | consumed samples: 29539840 | consumed tokens: 60497592320 | elapsed time per iteration (s): 1.04 | learning rate: 2.289E-05 | global batch size: 256 | lm loss: 1.911513E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.758 | TFLOPs: 40.61 | 15: iteration 115400/ 125429 | consumed samples: 29542400 | consumed tokens: 60502835200 | elapsed time per iteration (s): 1.04 | learning rate: 2.288E-05 | global batch size: 256 | lm loss: 1.890688E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.230 | TFLOPs: 40.53 | 15: iteration 115410/ 125429 | consumed samples: 29544960 | consumed tokens: 60508078080 | elapsed time per iteration (s): 1.08 | learning rate: 2.288E-05 | global batch size: 256 | lm loss: 1.902723E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.627 | TFLOPs: 39.27 | 15: iteration 115420/ 125429 | consumed samples: 29547520 | consumed tokens: 60513320960 | elapsed time per iteration (s): 1.07 | learning rate: 2.287E-05 | global batch size: 256 | lm loss: 1.874030E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.630 | TFLOPs: 39.44 | 15: iteration 115430/ 125429 | consumed samples: 29550080 | consumed tokens: 60518563840 | elapsed time per iteration (s): 1.05 | learning rate: 2.286E-05 | global batch size: 256 | lm loss: 1.895909E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.222 | TFLOPs: 40.19 | 15: iteration 115440/ 125429 | consumed samples: 29552640 | consumed tokens: 60523806720 | elapsed time per iteration (s): 1.05 | learning rate: 2.286E-05 | global batch size: 256 | lm loss: 1.875668E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.098 | TFLOPs: 40.17 | 15: iteration 115450/ 125429 | consumed samples: 29555200 | consumed tokens: 60529049600 | elapsed time per iteration (s): 1.05 | learning rate: 2.285E-05 | global batch size: 256 | lm loss: 1.885677E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.028 | TFLOPs: 40.33 | 15: iteration 115460/ 125429 | consumed samples: 29557760 | consumed tokens: 60534292480 | elapsed time per iteration (s): 1.06 | learning rate: 2.285E-05 | global batch size: 256 | lm loss: 1.876050E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.727 | TFLOPs: 39.78 | 15: iteration 115470/ 125429 | consumed samples: 29560320 | consumed tokens: 60539535360 | elapsed time per iteration (s): 1.04 | learning rate: 2.284E-05 | global batch size: 256 | lm loss: 1.898476E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.869 | TFLOPs: 40.80 | 15: iteration 115480/ 125429 | consumed samples: 29562880 | consumed tokens: 60544778240 | elapsed time per iteration (s): 1.04 | learning rate: 2.284E-05 | global batch size: 256 | lm loss: 1.893183E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.064 | TFLOPs: 40.83 | 15: iteration 115490/ 125429 | consumed samples: 29565440 | consumed tokens: 60550021120 | elapsed time per iteration (s): 1.02 | learning rate: 2.283E-05 | global batch size: 256 | lm loss: 1.897302E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.601 | TFLOPs: 41.41 | 15: iteration 115500/ 125429 | consumed samples: 29568000 | consumed tokens: 60555264000 | elapsed time per iteration (s): 1.11 | learning rate: 2.282E-05 | global batch size: 256 | lm loss: 1.911716E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.737 | TFLOPs: 38.13 | 15: iteration 115510/ 125429 | consumed samples: 29570560 | consumed tokens: 60560506880 | elapsed time per iteration (s): 1.06 | learning rate: 2.282E-05 | global batch size: 256 | lm loss: 1.901457E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.864 | TFLOPs: 39.80 | 15: iteration 115520/ 125429 | consumed samples: 29573120 | consumed tokens: 60565749760 | elapsed time per iteration (s): 1.04 | learning rate: 2.281E-05 | global batch size: 256 | lm loss: 1.898085E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.532 | TFLOPs: 40.58 | 15: iteration 115530/ 125429 | consumed samples: 29575680 | consumed tokens: 60570992640 | elapsed time per iteration (s): 1.04 | learning rate: 2.281E-05 | global batch size: 256 | lm loss: 1.875464E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.463 | TFLOPs: 40.56 | 15: iteration 115540/ 125429 | consumed samples: 29578240 | consumed tokens: 60576235520 | elapsed time per iteration (s): 1.06 | learning rate: 2.280E-05 | global batch size: 256 | lm loss: 1.878657E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.507 | TFLOPs: 40.08 | 15: iteration 115550/ 125429 | consumed samples: 29580800 | consumed tokens: 60581478400 | elapsed time per iteration (s): 1.10 | learning rate: 2.280E-05 | global batch size: 256 | lm loss: 1.868678E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.881 | TFLOPs: 38.49 | 15: iteration 115560/ 125429 | consumed samples: 29583360 | consumed tokens: 60586721280 | elapsed time per iteration (s): 1.02 | learning rate: 2.279E-05 | global batch size: 256 | lm loss: 1.880796E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.800 | TFLOPs: 41.45 | 15: iteration 115570/ 125429 | consumed samples: 29585920 | consumed tokens: 60591964160 | elapsed time per iteration (s): 1.08 | learning rate: 2.279E-05 | global batch size: 256 | lm loss: 1.902621E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.970 | TFLOPs: 39.16 | 15: iteration 115580/ 125429 | consumed samples: 29588480 | consumed tokens: 60597207040 | elapsed time per iteration (s): 1.04 | learning rate: 2.278E-05 | global batch size: 256 | lm loss: 1.877934E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.292 | TFLOPs: 40.54 | 15: iteration 115590/ 125429 | consumed samples: 29591040 | consumed tokens: 60602449920 | elapsed time per iteration (s): 1.09 | learning rate: 2.277E-05 | global batch size: 256 | lm loss: 1.901040E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.490 | TFLOPs: 38.75 | 15: iteration 115600/ 125429 | consumed samples: 29593600 | consumed tokens: 60607692800 | elapsed time per iteration (s): 1.05 | learning rate: 2.277E-05 | global batch size: 256 | lm loss: 1.902648E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.937 | TFLOPs: 40.31 | 15: iteration 115610/ 125429 | consumed samples: 29596160 | consumed tokens: 60612935680 | elapsed time per iteration (s): 1.02 | learning rate: 2.276E-05 | global batch size: 256 | lm loss: 1.904162E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.870 | TFLOPs: 41.46 | 15: iteration 115620/ 125429 | consumed samples: 29598720 | consumed tokens: 60618178560 | elapsed time per iteration (s): 1.03 | learning rate: 2.276E-05 | global batch size: 256 | lm loss: 1.885856E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.623 | TFLOPs: 40.92 | 15: iteration 115630/ 125429 | consumed samples: 29601280 | consumed tokens: 60623421440 | elapsed time per iteration (s): 1.02 | learning rate: 2.275E-05 | global batch size: 256 | lm loss: 1.904515E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.727 | TFLOPs: 41.43 | 15: iteration 115640/ 125429 | consumed samples: 29603840 | consumed tokens: 60628664320 | elapsed time per iteration (s): 1.07 | learning rate: 2.275E-05 | global batch size: 256 | lm loss: 1.898483E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.754 | TFLOPs: 39.62 | 15: iteration 115650/ 125429 | consumed samples: 29606400 | consumed tokens: 60633907200 | elapsed time per iteration (s): 1.07 | learning rate: 2.274E-05 | global batch size: 256 | lm loss: 1.884981E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.544 | TFLOPs: 39.59 | 15: iteration 115660/ 125429 | consumed samples: 29608960 | consumed tokens: 60639150080 | elapsed time per iteration (s): 1.03 | learning rate: 2.273E-05 | global batch size: 256 | lm loss: 1.893764E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.433 | TFLOPs: 40.89 | 15: iteration 115670/ 125429 | consumed samples: 29611520 | consumed tokens: 60644392960 | elapsed time per iteration (s): 1.07 | learning rate: 2.273E-05 | global batch size: 256 | lm loss: 1.873519E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.338 | TFLOPs: 39.55 | 15: iteration 115680/ 125429 | consumed samples: 29614080 | consumed tokens: 60649635840 | elapsed time per iteration (s): 1.20 | learning rate: 2.272E-05 | global batch size: 256 | lm loss: 1.899961E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.891 | TFLOPs: 35.18 | 15: iteration 115690/ 125429 | consumed samples: 29616640 | consumed tokens: 60654878720 | elapsed time per iteration (s): 1.04 | learning rate: 2.272E-05 | global batch size: 256 | lm loss: 1.917104E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.471 | TFLOPs: 40.57 | 15: iteration 115700/ 125429 | consumed samples: 29619200 | consumed tokens: 60660121600 | elapsed time per iteration (s): 1.02 | learning rate: 2.271E-05 | global batch size: 256 | lm loss: 1.858949E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.857 | TFLOPs: 41.29 | 15: iteration 115710/ 125429 | consumed samples: 29621760 | consumed tokens: 60665364480 | elapsed time per iteration (s): 1.04 | learning rate: 2.271E-05 | global batch size: 256 | lm loss: 1.888561E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.330 | TFLOPs: 40.71 | 15: iteration 115720/ 125429 | consumed samples: 29624320 | consumed tokens: 60670607360 | elapsed time per iteration (s): 1.08 | learning rate: 2.270E-05 | global batch size: 256 | lm loss: 1.904251E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.446 | TFLOPs: 39.24 | 15: iteration 115730/ 125429 | consumed samples: 29626880 | consumed tokens: 60675850240 | elapsed time per iteration (s): 1.06 | learning rate: 2.270E-05 | global batch size: 256 | lm loss: 1.879508E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.976 | TFLOPs: 39.82 | 15: iteration 115740/ 125429 | consumed samples: 29629440 | consumed tokens: 60681093120 | elapsed time per iteration (s): 1.05 | learning rate: 2.269E-05 | global batch size: 256 | lm loss: 1.906152E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.878 | TFLOPs: 40.47 | 15: iteration 115750/ 125429 | consumed samples: 29632000 | consumed tokens: 60686336000 | elapsed time per iteration (s): 1.03 | learning rate: 2.268E-05 | global batch size: 256 | lm loss: 1.883109E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.311 | TFLOPs: 41.04 | 15: iteration 115760/ 125429 | consumed samples: 29634560 | consumed tokens: 60691578880 | elapsed time per iteration (s): 1.03 | learning rate: 2.268E-05 | global batch size: 256 | lm loss: 1.897448E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.371 | TFLOPs: 40.88 | 15: iteration 115770/ 125429 | consumed samples: 29637120 | consumed tokens: 60696821760 | elapsed time per iteration (s): 1.04 | learning rate: 2.267E-05 | global batch size: 256 | lm loss: 1.909170E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.294 | TFLOPs: 40.70 | 15: iteration 115780/ 125429 | consumed samples: 29639680 | consumed tokens: 60702064640 | elapsed time per iteration (s): 1.08 | learning rate: 2.267E-05 | global batch size: 256 | lm loss: 1.870178E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.431 | TFLOPs: 39.24 | 15: iteration 115790/ 125429 | consumed samples: 29642240 | consumed tokens: 60707307520 | elapsed time per iteration (s): 1.04 | learning rate: 2.266E-05 | global batch size: 256 | lm loss: 1.886506E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.318 | TFLOPs: 40.54 | 15: iteration 115800/ 125429 | consumed samples: 29644800 | consumed tokens: 60712550400 | elapsed time per iteration (s): 1.02 | learning rate: 2.266E-05 | global batch size: 256 | lm loss: 1.893974E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.616 | TFLOPs: 41.58 | 15: iteration 115810/ 125429 | consumed samples: 29647360 | consumed tokens: 60717793280 | elapsed time per iteration (s): 1.03 | learning rate: 2.265E-05 | global batch size: 256 | lm loss: 1.883247E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.561 | TFLOPs: 40.91 | 15: iteration 115820/ 125429 | consumed samples: 29649920 | consumed tokens: 60723036160 | elapsed time per iteration (s): 1.06 | learning rate: 2.265E-05 | global batch size: 256 | lm loss: 1.888916E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.578 | TFLOPs: 40.09 | 15: iteration 115830/ 125429 | consumed samples: 29652480 | consumed tokens: 60728279040 | elapsed time per iteration (s): 1.05 | learning rate: 2.264E-05 | global batch size: 256 | lm loss: 1.888164E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.705 | TFLOPs: 40.27 | 15: iteration 115840/ 125429 | consumed samples: 29655040 | consumed tokens: 60733521920 | elapsed time per iteration (s): 1.05 | learning rate: 2.264E-05 | global batch size: 256 | lm loss: 1.901174E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.653 | TFLOPs: 40.43 | 15: iteration 115850/ 125429 | consumed samples: 29657600 | consumed tokens: 60738764800 | elapsed time per iteration (s): 1.04 | learning rate: 2.263E-05 | global batch size: 256 | lm loss: 1.876117E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.611 | TFLOPs: 40.75 | 15: iteration 115860/ 125429 | consumed samples: 29660160 | consumed tokens: 60744007680 | elapsed time per iteration (s): 1.03 | learning rate: 2.262E-05 | global batch size: 256 | lm loss: 1.869456E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.684 | TFLOPs: 41.26 | 15: iteration 115870/ 125429 | consumed samples: 29662720 | consumed tokens: 60749250560 | elapsed time per iteration (s): 1.05 | learning rate: 2.262E-05 | global batch size: 256 | lm loss: 1.898225E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.997 | TFLOPs: 40.32 | 15: iteration 115880/ 125429 | consumed samples: 29665280 | consumed tokens: 60754493440 | elapsed time per iteration (s): 1.02 | learning rate: 2.261E-05 | global batch size: 256 | lm loss: 1.893248E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.122 | TFLOPs: 41.33 | 15: iteration 115890/ 125429 | consumed samples: 29667840 | consumed tokens: 60759736320 | elapsed time per iteration (s): 1.03 | learning rate: 2.261E-05 | global batch size: 256 | lm loss: 1.898240E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.535 | TFLOPs: 41.07 | 15: iteration 115900/ 125429 | consumed samples: 29670400 | consumed tokens: 60764979200 | elapsed time per iteration (s): 1.03 | learning rate: 2.260E-05 | global batch size: 256 | lm loss: 1.895335E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.690 | TFLOPs: 41.26 | 15: iteration 115910/ 125429 | consumed samples: 29672960 | consumed tokens: 60770222080 | elapsed time per iteration (s): 1.06 | learning rate: 2.260E-05 | global batch size: 256 | lm loss: 1.913110E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.844 | TFLOPs: 39.97 | 15: iteration 115920/ 125429 | consumed samples: 29675520 | consumed tokens: 60775464960 | elapsed time per iteration (s): 1.06 | learning rate: 2.259E-05 | global batch size: 256 | lm loss: 1.865521E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.946 | TFLOPs: 39.82 | 15: iteration 115930/ 125429 | consumed samples: 29678080 | consumed tokens: 60780707840 | elapsed time per iteration (s): 1.04 | learning rate: 2.259E-05 | global batch size: 256 | lm loss: 1.897312E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.304 | TFLOPs: 40.87 | 15: iteration 115940/ 125429 | consumed samples: 29680640 | consumed tokens: 60785950720 | elapsed time per iteration (s): 1.07 | learning rate: 2.258E-05 | global batch size: 256 | lm loss: 1.909404E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.615 | TFLOPs: 39.43 | 15: iteration 115950/ 125429 | consumed samples: 29683200 | consumed tokens: 60791193600 | elapsed time per iteration (s): 1.02 | learning rate: 2.258E-05 | global batch size: 256 | lm loss: 1.880430E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.540 | TFLOPs: 41.57 | 15: iteration 115960/ 125429 | consumed samples: 29685760 | consumed tokens: 60796436480 | elapsed time per iteration (s): 1.08 | learning rate: 2.257E-05 | global batch size: 256 | lm loss: 1.884153E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.009 | TFLOPs: 39.33 | 15: iteration 115970/ 125429 | consumed samples: 29688320 | consumed tokens: 60801679360 | elapsed time per iteration (s): 1.07 | learning rate: 2.256E-05 | global batch size: 256 | lm loss: 1.894491E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.350 | TFLOPs: 39.39 | 15: iteration 115980/ 125429 | consumed samples: 29690880 | consumed tokens: 60806922240 | elapsed time per iteration (s): 1.09 | learning rate: 2.256E-05 | global batch size: 256 | lm loss: 1.867945E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.651 | TFLOPs: 38.94 | 15: iteration 115990/ 125429 | consumed samples: 29693440 | consumed tokens: 60812165120 | elapsed time per iteration (s): 1.06 | learning rate: 2.255E-05 | global batch size: 256 | lm loss: 1.875552E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.315 | TFLOPs: 40.04 | 0: [2022-11-27 06:21:31,900] [INFO] [logging.py:68:log_dist] [Rank 0] step=116000, skipped=0, lr=[2.2548717338730183e-05, 2.2548717338730183e-05, 2.2548717338730183e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 116000/ 125429 | consumed samples: 29696000 | consumed tokens: 60817408000 | elapsed time per iteration (s): 1.05 | learning rate: 2.255E-05 | global batch size: 256 | lm loss: 1.889672E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.563 | TFLOPs: 40.25 | 0: steps: 116000 loss: 1.9378 iter time (s): 1.051 samples/sec: 243.564 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 116000 | lm loss value: 1.898830E+00 | lm loss PPL: 6.678075E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 116000 to checkpoints_1b5 0: [2022-11-27 06:21:32,280] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step116000 is begin to save! 0: [2022-11-27 06:21:32,291] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:21:32,522] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:21:32,522] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:21:32,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:21:32,623] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:21:32,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:21:32,723] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:21:32,830] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:21:32,830] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:21:32,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:21:32,932] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:21:33,037] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:21:33,037] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:21:33,137] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:21:33,138] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:21:33,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:21:33,242] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:21:33,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:21:33,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:21:33,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:21:33,449] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:21:33,552] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:21:33,552] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:21:33,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:21:33,658] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:21:33,758] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:21:33,759] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:21:33,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:21:33,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:21:33,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:21:33,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:21:34,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:21:34,077] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:21:34,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:21:34,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:21:34,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:21:34,283] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:21:34,380] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:21:34,381] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:21:34,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:21:34,489] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:21:34,591] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:21:34,591] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:21:34,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:21:34,696] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:21:34,797] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:21:34,797] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:21:34,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:21:34,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:21:35,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:21:35,003] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:21:35,106] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:21:35,107] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:21:35,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:21:35,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_29-model_00-model_states.pt... 0: [2022-11-27 06:21:35,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_29-model_00-model_states.pt. 0: [2022-11-27 06:21:35,313] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:21:35,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:21:35,412] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/layer_32-model_00-model_states.pt... 0: [2022-11-27 06:21:35,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/layer_32-model_00-model_states.pt. 0: [2022-11-27 06:21:35,418] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step116000/mp_rank_00_model_states.pt 0: [2022-11-27 06:21:35,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:21:35,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/mp_rank_00_model_states.pt. 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:21:35,463] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step116000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:21:35,618] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,618] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,618] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,619] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,619] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:21:35,619] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-27 06:21:35,620] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,620] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,620] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-27 06:21:35,622] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,622] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,622] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,623] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:21:35,623] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,624] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,624] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:21:35,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-27 06:21:35,627] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,627] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,627] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-27 06:21:35,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,626] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,626] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-27 06:21:35,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,628] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,628] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,628] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,629] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:21:35,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-27 06:21:35,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,630] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,630] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,631] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 06:21:35,631] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 06:21:35,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 13: [2022-11-27 06:21:35,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,632] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 06:21:35,632] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 06:21:35,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,635] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:21:35,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,636] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,636] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,636] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-27 06:21:35,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-27 06:21:35,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 06:21:35,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-27 06:21:35,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,639] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 06:21:35,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,640] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,640] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 06:21:35,640] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,641] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,641] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,641] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-27 06:21:35,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,643] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,643] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-27 06:21:35,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:21:35,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,645] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,639] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,642] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,639] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,645] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 06:21:35,645] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-27 06:21:35,642] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 9: [2022-11-27 06:21:35,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,642] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:21:35,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,644] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,644] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 06:21:35,644] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,647] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,647] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-27 06:21:35,648] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,648] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,648] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-27 06:21:35,649] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,649] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,649] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:21:35,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-27 06:21:35,652] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,652] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,652] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,653] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,653] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,653] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-27 06:21:35,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,656] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,656] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,656] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-27 06:21:35,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,657] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 12: [2022-11-27 06:21:35,657] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,658] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,657] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,659] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,659] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,659] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,661] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,661] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,665] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:21:35,666] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,666] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,654] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,654] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:21:35,654] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:21:35,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:21:35,662] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,662] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,663] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,667] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:21:35,667] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,667] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,670] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:21:35,670] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,670] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,669] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,669] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 7: [2022-11-27 06:21:35,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,678] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:21:35,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,678] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 7: [2022-11-27 06:21:35,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,678] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 12: [2022-11-27 06:21:35,679] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:21:35,679] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:21:35,679] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,686] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,686] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 13: [2022-11-27 06:21:35,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:21:35,692] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 06:21:35,692] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 9: [2022-11-27 06:21:35,696] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:21:35,696] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:21:35,696] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 8: [2022-11-27 06:21:35,697] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:21:35,698] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:21:35,698] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,688] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:21:35,688] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,688] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:21:35,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 14: [2022-11-27 06:21:35,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:21:35,699] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:21:35,699] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:21:35,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 4: [2022-11-27 06:21:35,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:21:35,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 06:21:35,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 3: [2022-11-27 06:21:35,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:21:35,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 06:21:35,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 11: [2022-11-27 06:21:35,749] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: [2022-11-27 06:21:35,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 06:21:35,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 5: [2022-11-27 06:21:35,776] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:21:35,777] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:21:35,777] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,804] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 6: [2022-11-27 06:21:35,805] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 2: [2022-11-27 06:21:35,818] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:21:35,819] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 06:21:35,819] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,864] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 15: [2022-11-27 06:21:35,865] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:21:35,865] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 06:21:35,865] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:21:35,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 1: [2022-11-27 06:21:35,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:21:35,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step116000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 06:21:35,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step116000 is ready now! 0: successfully saved checkpoint at iteration 116000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3684.33 15: iteration 116010/ 125429 | consumed samples: 29698560 | consumed tokens: 60822650880 | elapsed time per iteration (s): 1.47 | learning rate: 2.254E-05 | global batch size: 256 | lm loss: 1.873238E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 173.666 | TFLOPs: 28.70 | 15: iteration 116020/ 125429 | consumed samples: 29701120 | consumed tokens: 60827893760 | elapsed time per iteration (s): 1.05 | learning rate: 2.254E-05 | global batch size: 256 | lm loss: 1.868369E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.958 | TFLOPs: 40.32 | 15: iteration 116030/ 125429 | consumed samples: 29703680 | consumed tokens: 60833136640 | elapsed time per iteration (s): 1.04 | learning rate: 2.253E-05 | global batch size: 256 | lm loss: 1.880779E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.754 | TFLOPs: 40.61 | 15: iteration 116040/ 125429 | consumed samples: 29706240 | consumed tokens: 60838379520 | elapsed time per iteration (s): 1.03 | learning rate: 2.253E-05 | global batch size: 256 | lm loss: 1.894997E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.477 | TFLOPs: 41.06 | 15: iteration 116050/ 125429 | consumed samples: 29708800 | consumed tokens: 60843622400 | elapsed time per iteration (s): 1.04 | learning rate: 2.252E-05 | global batch size: 256 | lm loss: 1.901637E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.365 | TFLOPs: 40.55 | 15: iteration 116060/ 125429 | consumed samples: 29711360 | consumed tokens: 60848865280 | elapsed time per iteration (s): 1.04 | learning rate: 2.252E-05 | global batch size: 256 | lm loss: 1.914177E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.806 | TFLOPs: 40.62 | 15: iteration 116070/ 125429 | consumed samples: 29713920 | consumed tokens: 60854108160 | elapsed time per iteration (s): 1.04 | learning rate: 2.251E-05 | global batch size: 256 | lm loss: 1.897445E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.084 | TFLOPs: 40.50 | 15: iteration 116080/ 125429 | consumed samples: 29716480 | consumed tokens: 60859351040 | elapsed time per iteration (s): 1.03 | learning rate: 2.251E-05 | global batch size: 256 | lm loss: 1.903429E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.225 | TFLOPs: 41.19 | 15: iteration 116090/ 125429 | consumed samples: 29719040 | consumed tokens: 60864593920 | elapsed time per iteration (s): 1.08 | learning rate: 2.250E-05 | global batch size: 256 | lm loss: 1.879504E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.919 | TFLOPs: 39.15 | 15: iteration 116100/ 125429 | consumed samples: 29721600 | consumed tokens: 60869836800 | elapsed time per iteration (s): 1.06 | learning rate: 2.250E-05 | global batch size: 256 | lm loss: 1.910269E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.686 | TFLOPs: 39.94 | 15: iteration 116110/ 125429 | consumed samples: 29724160 | consumed tokens: 60875079680 | elapsed time per iteration (s): 1.05 | learning rate: 2.249E-05 | global batch size: 256 | lm loss: 1.879373E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.925 | TFLOPs: 40.15 | 15: iteration 116120/ 125429 | consumed samples: 29726720 | consumed tokens: 60880322560 | elapsed time per iteration (s): 1.05 | learning rate: 2.248E-05 | global batch size: 256 | lm loss: 1.889111E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.979 | TFLOPs: 40.32 | 15: iteration 116130/ 125429 | consumed samples: 29729280 | consumed tokens: 60885565440 | elapsed time per iteration (s): 1.06 | learning rate: 2.248E-05 | global batch size: 256 | lm loss: 1.883246E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.131 | TFLOPs: 40.01 | 15: iteration 116140/ 125429 | consumed samples: 29731840 | consumed tokens: 60890808320 | elapsed time per iteration (s): 1.03 | learning rate: 2.247E-05 | global batch size: 256 | lm loss: 1.921548E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.152 | TFLOPs: 41.17 | 15: iteration 116150/ 125429 | consumed samples: 29734400 | consumed tokens: 60896051200 | elapsed time per iteration (s): 1.03 | learning rate: 2.247E-05 | global batch size: 256 | lm loss: 1.889256E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.823 | TFLOPs: 40.95 | 15: iteration 116160/ 125429 | consumed samples: 29736960 | consumed tokens: 60901294080 | elapsed time per iteration (s): 1.04 | learning rate: 2.246E-05 | global batch size: 256 | lm loss: 1.906976E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.562 | TFLOPs: 40.58 | 15: iteration 116170/ 125429 | consumed samples: 29739520 | consumed tokens: 60906536960 | elapsed time per iteration (s): 1.06 | learning rate: 2.246E-05 | global batch size: 256 | lm loss: 1.889067E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.600 | TFLOPs: 40.09 | 15: iteration 116180/ 125429 | consumed samples: 29742080 | consumed tokens: 60911779840 | elapsed time per iteration (s): 1.02 | learning rate: 2.245E-05 | global batch size: 256 | lm loss: 1.877725E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.833 | TFLOPs: 41.29 | 15: iteration 116190/ 125429 | consumed samples: 29744640 | consumed tokens: 60917022720 | elapsed time per iteration (s): 1.11 | learning rate: 2.245E-05 | global batch size: 256 | lm loss: 1.910711E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.563 | TFLOPs: 38.27 | 15: iteration 116200/ 125429 | consumed samples: 29747200 | consumed tokens: 60922265600 | elapsed time per iteration (s): 1.10 | learning rate: 2.244E-05 | global batch size: 256 | lm loss: 1.868513E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.757 | TFLOPs: 38.46 | 15: iteration 116210/ 125429 | consumed samples: 29749760 | consumed tokens: 60927508480 | elapsed time per iteration (s): 1.07 | learning rate: 2.244E-05 | global batch size: 256 | lm loss: 1.881547E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.638 | TFLOPs: 39.44 | 15: iteration 116220/ 125429 | consumed samples: 29752320 | consumed tokens: 60932751360 | elapsed time per iteration (s): 1.03 | learning rate: 2.243E-05 | global batch size: 256 | lm loss: 1.885794E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.993 | TFLOPs: 40.98 | 15: iteration 116230/ 125429 | consumed samples: 29754880 | consumed tokens: 60937994240 | elapsed time per iteration (s): 1.06 | learning rate: 2.243E-05 | global batch size: 256 | lm loss: 1.870372E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.400 | TFLOPs: 40.06 | 15: iteration 116240/ 125429 | consumed samples: 29757440 | consumed tokens: 60943237120 | elapsed time per iteration (s): 1.02 | learning rate: 2.242E-05 | global batch size: 256 | lm loss: 1.875812E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.392 | TFLOPs: 41.54 | 15: iteration 116250/ 125429 | consumed samples: 29760000 | consumed tokens: 60948480000 | elapsed time per iteration (s): 1.04 | learning rate: 2.242E-05 | global batch size: 256 | lm loss: 1.920387E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.844 | TFLOPs: 40.63 | 15: iteration 116260/ 125429 | consumed samples: 29762560 | consumed tokens: 60953722880 | elapsed time per iteration (s): 1.06 | learning rate: 2.241E-05 | global batch size: 256 | lm loss: 1.898726E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.364 | TFLOPs: 40.05 | 15: iteration 116270/ 125429 | consumed samples: 29765120 | consumed tokens: 60958965760 | elapsed time per iteration (s): 1.06 | learning rate: 2.241E-05 | global batch size: 256 | lm loss: 1.894357E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.739 | TFLOPs: 39.78 | 15: iteration 116280/ 125429 | consumed samples: 29767680 | consumed tokens: 60964208640 | elapsed time per iteration (s): 1.08 | learning rate: 2.240E-05 | global batch size: 256 | lm loss: 1.905065E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.030 | TFLOPs: 39.17 | 15: iteration 116290/ 125429 | consumed samples: 29770240 | consumed tokens: 60969451520 | elapsed time per iteration (s): 1.05 | learning rate: 2.240E-05 | global batch size: 256 | lm loss: 1.883685E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.380 | TFLOPs: 40.22 | 15: iteration 116300/ 125429 | consumed samples: 29772800 | consumed tokens: 60974694400 | elapsed time per iteration (s): 1.05 | learning rate: 2.239E-05 | global batch size: 256 | lm loss: 1.868579E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.267 | TFLOPs: 40.37 | 15: iteration 116310/ 125429 | consumed samples: 29775360 | consumed tokens: 60979937280 | elapsed time per iteration (s): 1.04 | learning rate: 2.238E-05 | global batch size: 256 | lm loss: 1.896593E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.693 | TFLOPs: 40.60 | 15: iteration 116320/ 125429 | consumed samples: 29777920 | consumed tokens: 60985180160 | elapsed time per iteration (s): 1.03 | learning rate: 2.238E-05 | global batch size: 256 | lm loss: 1.880323E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.352 | TFLOPs: 40.88 | 15: iteration 116330/ 125429 | consumed samples: 29780480 | consumed tokens: 60990423040 | elapsed time per iteration (s): 1.05 | learning rate: 2.237E-05 | global batch size: 256 | lm loss: 1.864314E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.074 | TFLOPs: 40.34 | 15: iteration 116340/ 125429 | consumed samples: 29783040 | consumed tokens: 60995665920 | elapsed time per iteration (s): 1.18 | learning rate: 2.237E-05 | global batch size: 256 | lm loss: 1.924821E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.813 | TFLOPs: 35.83 | 15: iteration 116350/ 125429 | consumed samples: 29785600 | consumed tokens: 61000908800 | elapsed time per iteration (s): 1.04 | learning rate: 2.236E-05 | global batch size: 256 | lm loss: 1.901688E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.326 | TFLOPs: 40.54 | 15: iteration 116360/ 125429 | consumed samples: 29788160 | consumed tokens: 61006151680 | elapsed time per iteration (s): 1.02 | learning rate: 2.236E-05 | global batch size: 256 | lm loss: 1.917192E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.920 | TFLOPs: 41.47 | 15: iteration 116370/ 125429 | consumed samples: 29790720 | consumed tokens: 61011394560 | elapsed time per iteration (s): 1.03 | learning rate: 2.235E-05 | global batch size: 256 | lm loss: 1.872492E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.413 | TFLOPs: 40.89 | 15: iteration 116380/ 125429 | consumed samples: 29793280 | consumed tokens: 61016637440 | elapsed time per iteration (s): 1.02 | learning rate: 2.235E-05 | global batch size: 256 | lm loss: 1.885754E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.550 | TFLOPs: 41.41 | 15: iteration 116390/ 125429 | consumed samples: 29795840 | consumed tokens: 61021880320 | elapsed time per iteration (s): 1.04 | learning rate: 2.234E-05 | global batch size: 256 | lm loss: 1.897607E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.197 | TFLOPs: 40.52 | 15: iteration 116400/ 125429 | consumed samples: 29798400 | consumed tokens: 61027123200 | elapsed time per iteration (s): 1.05 | learning rate: 2.234E-05 | global batch size: 256 | lm loss: 1.873776E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.658 | TFLOPs: 40.43 | 15: iteration 116410/ 125429 | consumed samples: 29800960 | consumed tokens: 61032366080 | elapsed time per iteration (s): 1.05 | learning rate: 2.233E-05 | global batch size: 256 | lm loss: 1.870433E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.820 | TFLOPs: 40.46 | 15: iteration 116420/ 125429 | consumed samples: 29803520 | consumed tokens: 61037608960 | elapsed time per iteration (s): 1.05 | learning rate: 2.233E-05 | global batch size: 256 | lm loss: 1.882506E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.989 | TFLOPs: 40.16 | 15: iteration 116430/ 125429 | consumed samples: 29806080 | consumed tokens: 61042851840 | elapsed time per iteration (s): 1.06 | learning rate: 2.232E-05 | global batch size: 256 | lm loss: 1.881083E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.135 | TFLOPs: 40.01 | 15: iteration 116440/ 125429 | consumed samples: 29808640 | consumed tokens: 61048094720 | elapsed time per iteration (s): 1.05 | learning rate: 2.232E-05 | global batch size: 256 | lm loss: 1.874688E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.676 | TFLOPs: 40.43 | 15: iteration 116450/ 125429 | consumed samples: 29811200 | consumed tokens: 61053337600 | elapsed time per iteration (s): 1.04 | learning rate: 2.231E-05 | global batch size: 256 | lm loss: 1.889029E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.162 | TFLOPs: 40.85 | 15: iteration 116460/ 125429 | consumed samples: 29813760 | consumed tokens: 61058580480 | elapsed time per iteration (s): 1.04 | learning rate: 2.231E-05 | global batch size: 256 | lm loss: 1.874445E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.081 | TFLOPs: 40.83 | 15: iteration 116470/ 125429 | consumed samples: 29816320 | consumed tokens: 61063823360 | elapsed time per iteration (s): 1.03 | learning rate: 2.230E-05 | global batch size: 256 | lm loss: 1.899404E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.381 | TFLOPs: 40.88 | 15: iteration 116480/ 125429 | consumed samples: 29818880 | consumed tokens: 61069066240 | elapsed time per iteration (s): 1.08 | learning rate: 2.230E-05 | global batch size: 256 | lm loss: 1.884541E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.928 | TFLOPs: 39.15 | 15: iteration 116490/ 125429 | consumed samples: 29821440 | consumed tokens: 61074309120 | elapsed time per iteration (s): 1.21 | learning rate: 2.229E-05 | global batch size: 256 | lm loss: 1.873153E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.014 | TFLOPs: 35.04 | 15: iteration 116500/ 125429 | consumed samples: 29824000 | consumed tokens: 61079552000 | elapsed time per iteration (s): 1.03 | learning rate: 2.229E-05 | global batch size: 256 | lm loss: 1.916720E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.385 | TFLOPs: 40.88 | 15: iteration 116510/ 125429 | consumed samples: 29826560 | consumed tokens: 61084794880 | elapsed time per iteration (s): 1.03 | learning rate: 2.228E-05 | global batch size: 256 | lm loss: 1.897987E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.462 | TFLOPs: 41.23 | 15: iteration 116520/ 125429 | consumed samples: 29829120 | consumed tokens: 61090037760 | elapsed time per iteration (s): 2.85 | learning rate: 2.228E-05 | global batch size: 256 | lm loss: 1.882109E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 89.908 | TFLOPs: 14.86 | 15: iteration 116530/ 125429 | consumed samples: 29831680 | consumed tokens: 61095280640 | elapsed time per iteration (s): 1.03 | learning rate: 2.227E-05 | global batch size: 256 | lm loss: 1.902124E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.481 | TFLOPs: 41.06 | 15: iteration 116540/ 125429 | consumed samples: 29834240 | consumed tokens: 61100523520 | elapsed time per iteration (s): 1.04 | learning rate: 2.227E-05 | global batch size: 256 | lm loss: 1.894715E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.151 | TFLOPs: 40.84 | 15: iteration 116550/ 125429 | consumed samples: 29836800 | consumed tokens: 61105766400 | elapsed time per iteration (s): 1.03 | learning rate: 2.226E-05 | global batch size: 256 | lm loss: 1.887910E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.339 | TFLOPs: 41.04 | 15: iteration 116560/ 125429 | consumed samples: 29839360 | consumed tokens: 61111009280 | elapsed time per iteration (s): 1.02 | learning rate: 2.226E-05 | global batch size: 256 | lm loss: 1.895825E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.538 | TFLOPs: 41.40 | 15: iteration 116570/ 125429 | consumed samples: 29841920 | consumed tokens: 61116252160 | elapsed time per iteration (s): 1.04 | learning rate: 2.225E-05 | global batch size: 256 | lm loss: 1.900768E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.297 | TFLOPs: 40.54 | 15: iteration 116580/ 125429 | consumed samples: 29844480 | consumed tokens: 61121495040 | elapsed time per iteration (s): 1.05 | learning rate: 2.225E-05 | global batch size: 256 | lm loss: 1.896534E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.178 | TFLOPs: 40.35 | 15: iteration 116590/ 125429 | consumed samples: 29847040 | consumed tokens: 61126737920 | elapsed time per iteration (s): 1.05 | learning rate: 2.224E-05 | global batch size: 256 | lm loss: 1.888449E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.713 | TFLOPs: 40.44 | 15: iteration 116600/ 125429 | consumed samples: 29849600 | consumed tokens: 61131980800 | elapsed time per iteration (s): 1.05 | learning rate: 2.224E-05 | global batch size: 256 | lm loss: 1.908084E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.899 | TFLOPs: 40.47 | 15: iteration 116610/ 125429 | consumed samples: 29852160 | consumed tokens: 61137223680 | elapsed time per iteration (s): 1.04 | learning rate: 2.223E-05 | global batch size: 256 | lm loss: 1.879473E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.317 | TFLOPs: 40.87 | 15: iteration 116620/ 125429 | consumed samples: 29854720 | consumed tokens: 61142466560 | elapsed time per iteration (s): 1.06 | learning rate: 2.223E-05 | global batch size: 256 | lm loss: 1.894311E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.596 | TFLOPs: 40.09 | 15: iteration 116630/ 125429 | consumed samples: 29857280 | consumed tokens: 61147709440 | elapsed time per iteration (s): 1.03 | learning rate: 2.222E-05 | global batch size: 256 | lm loss: 1.894996E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.577 | TFLOPs: 40.91 | 15: iteration 116640/ 125429 | consumed samples: 29859840 | consumed tokens: 61152952320 | elapsed time per iteration (s): 1.04 | learning rate: 2.222E-05 | global batch size: 256 | lm loss: 1.907900E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.331 | TFLOPs: 40.87 | 15: iteration 116650/ 125429 | consumed samples: 29862400 | consumed tokens: 61158195200 | elapsed time per iteration (s): 1.06 | learning rate: 2.221E-05 | global batch size: 256 | lm loss: 1.895404E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.095 | TFLOPs: 40.01 | 15: iteration 116660/ 125429 | consumed samples: 29864960 | consumed tokens: 61163438080 | elapsed time per iteration (s): 1.08 | learning rate: 2.221E-05 | global batch size: 256 | lm loss: 1.874540E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.191 | TFLOPs: 39.20 | 15: iteration 116670/ 125429 | consumed samples: 29867520 | consumed tokens: 61168680960 | elapsed time per iteration (s): 1.03 | learning rate: 2.220E-05 | global batch size: 256 | lm loss: 1.864625E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.474 | TFLOPs: 41.23 | 15: iteration 116680/ 125429 | consumed samples: 29870080 | consumed tokens: 61173923840 | elapsed time per iteration (s): 1.03 | learning rate: 2.220E-05 | global batch size: 256 | lm loss: 1.919658E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.385 | TFLOPs: 41.21 | 15: iteration 116690/ 125429 | consumed samples: 29872640 | consumed tokens: 61179166720 | elapsed time per iteration (s): 1.05 | learning rate: 2.219E-05 | global batch size: 256 | lm loss: 1.907721E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.167 | TFLOPs: 40.19 | 15: iteration 116700/ 125429 | consumed samples: 29875200 | consumed tokens: 61184409600 | elapsed time per iteration (s): 1.04 | learning rate: 2.219E-05 | global batch size: 256 | lm loss: 1.886209E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.026 | TFLOPs: 40.82 | 15: iteration 116710/ 125429 | consumed samples: 29877760 | consumed tokens: 61189652480 | elapsed time per iteration (s): 1.05 | learning rate: 2.218E-05 | global batch size: 256 | lm loss: 1.882210E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.518 | TFLOPs: 40.41 | 15: iteration 116720/ 125429 | consumed samples: 29880320 | consumed tokens: 61194895360 | elapsed time per iteration (s): 1.02 | learning rate: 2.218E-05 | global batch size: 256 | lm loss: 1.911885E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.538 | TFLOPs: 41.40 | 15: iteration 116730/ 125429 | consumed samples: 29882880 | consumed tokens: 61200138240 | elapsed time per iteration (s): 1.04 | learning rate: 2.217E-05 | global batch size: 256 | lm loss: 1.882808E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.136 | TFLOPs: 40.51 | 15: iteration 116740/ 125429 | consumed samples: 29885440 | consumed tokens: 61205381120 | elapsed time per iteration (s): 1.06 | learning rate: 2.217E-05 | global batch size: 256 | lm loss: 1.906222E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.422 | TFLOPs: 39.90 | 15: iteration 116750/ 125429 | consumed samples: 29888000 | consumed tokens: 61210624000 | elapsed time per iteration (s): 1.20 | learning rate: 2.216E-05 | global batch size: 256 | lm loss: 1.878996E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.727 | TFLOPs: 35.15 | 15: iteration 116760/ 125429 | consumed samples: 29890560 | consumed tokens: 61215866880 | elapsed time per iteration (s): 1.05 | learning rate: 2.216E-05 | global batch size: 256 | lm loss: 1.877259E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.949 | TFLOPs: 40.31 | 15: iteration 116770/ 125429 | consumed samples: 29893120 | consumed tokens: 61221109760 | elapsed time per iteration (s): 1.04 | learning rate: 2.215E-05 | global batch size: 256 | lm loss: 1.925934E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.954 | TFLOPs: 40.81 | 15: iteration 116780/ 125429 | consumed samples: 29895680 | consumed tokens: 61226352640 | elapsed time per iteration (s): 1.02 | learning rate: 2.215E-05 | global batch size: 256 | lm loss: 1.881709E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.304 | TFLOPs: 41.36 | 15: iteration 116790/ 125429 | consumed samples: 29898240 | consumed tokens: 61231595520 | elapsed time per iteration (s): 1.04 | learning rate: 2.214E-05 | global batch size: 256 | lm loss: 1.903034E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.137 | TFLOPs: 40.68 | 15: iteration 116800/ 125429 | consumed samples: 29900800 | consumed tokens: 61236838400 | elapsed time per iteration (s): 1.03 | learning rate: 2.214E-05 | global batch size: 256 | lm loss: 1.896135E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.242 | TFLOPs: 41.02 | 15: iteration 116810/ 125429 | consumed samples: 29903360 | consumed tokens: 61242081280 | elapsed time per iteration (s): 1.03 | learning rate: 2.213E-05 | global batch size: 256 | lm loss: 1.905910E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.321 | TFLOPs: 41.20 | 15: iteration 116820/ 125429 | consumed samples: 29905920 | consumed tokens: 61247324160 | elapsed time per iteration (s): 1.04 | learning rate: 2.213E-05 | global batch size: 256 | lm loss: 1.887532E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.853 | TFLOPs: 40.63 | 15: iteration 116830/ 125429 | consumed samples: 29908480 | consumed tokens: 61252567040 | elapsed time per iteration (s): 1.04 | learning rate: 2.212E-05 | global batch size: 256 | lm loss: 1.885155E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.952 | TFLOPs: 40.81 | 15: iteration 116840/ 125429 | consumed samples: 29911040 | consumed tokens: 61257809920 | elapsed time per iteration (s): 1.19 | learning rate: 2.212E-05 | global batch size: 256 | lm loss: 1.883858E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.977 | TFLOPs: 35.53 | 15: iteration 116850/ 125429 | consumed samples: 29913600 | consumed tokens: 61263052800 | elapsed time per iteration (s): 2.70 | learning rate: 2.211E-05 | global batch size: 256 | lm loss: 1.930358E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 94.706 | TFLOPs: 15.65 | 15: iteration 116860/ 125429 | consumed samples: 29916160 | consumed tokens: 61268295680 | elapsed time per iteration (s): 1.03 | learning rate: 2.211E-05 | global batch size: 256 | lm loss: 1.902700E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.354 | TFLOPs: 40.88 | 15: iteration 116870/ 125429 | consumed samples: 29918720 | consumed tokens: 61273538560 | elapsed time per iteration (s): 1.07 | learning rate: 2.210E-05 | global batch size: 256 | lm loss: 1.905295E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.028 | TFLOPs: 39.67 | 15: iteration 116880/ 125429 | consumed samples: 29921280 | consumed tokens: 61278781440 | elapsed time per iteration (s): 1.04 | learning rate: 2.210E-05 | global batch size: 256 | lm loss: 1.888980E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.961 | TFLOPs: 40.65 | 15: iteration 116890/ 125429 | consumed samples: 29923840 | consumed tokens: 61284024320 | elapsed time per iteration (s): 1.07 | learning rate: 2.209E-05 | global batch size: 256 | lm loss: 1.905298E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.004 | TFLOPs: 39.66 | 15: iteration 116900/ 125429 | consumed samples: 29926400 | consumed tokens: 61289267200 | elapsed time per iteration (s): 1.04 | learning rate: 2.209E-05 | global batch size: 256 | lm loss: 1.900428E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.411 | TFLOPs: 40.56 | 15: iteration 116910/ 125429 | consumed samples: 29928960 | consumed tokens: 61294510080 | elapsed time per iteration (s): 1.03 | learning rate: 2.208E-05 | global batch size: 256 | lm loss: 1.862619E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.858 | TFLOPs: 41.13 | 15: iteration 116920/ 125429 | consumed samples: 29931520 | consumed tokens: 61299752960 | elapsed time per iteration (s): 1.03 | learning rate: 2.208E-05 | global batch size: 256 | lm loss: 1.914233E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.133 | TFLOPs: 41.17 | 15: iteration 116930/ 125429 | consumed samples: 29934080 | consumed tokens: 61304995840 | elapsed time per iteration (s): 1.06 | learning rate: 2.207E-05 | global batch size: 256 | lm loss: 1.885665E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.086 | TFLOPs: 39.84 | 15: iteration 116940/ 125429 | consumed samples: 29936640 | consumed tokens: 61310238720 | elapsed time per iteration (s): 1.04 | learning rate: 2.207E-05 | global batch size: 256 | lm loss: 1.857742E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.226 | TFLOPs: 40.69 | 15: iteration 116950/ 125429 | consumed samples: 29939200 | consumed tokens: 61315481600 | elapsed time per iteration (s): 1.04 | learning rate: 2.206E-05 | global batch size: 256 | lm loss: 1.894923E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.007 | TFLOPs: 40.65 | 15: iteration 116960/ 125429 | consumed samples: 29941760 | consumed tokens: 61320724480 | elapsed time per iteration (s): 1.04 | learning rate: 2.206E-05 | global batch size: 256 | lm loss: 1.856412E+00 | grad norm: 0.151 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.158 | TFLOPs: 40.68 | 15: iteration 116970/ 125429 | consumed samples: 29944320 | consumed tokens: 61325967360 | elapsed time per iteration (s): 1.06 | learning rate: 2.205E-05 | global batch size: 256 | lm loss: 1.885639E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.479 | TFLOPs: 40.07 | 15: iteration 116980/ 125429 | consumed samples: 29946880 | consumed tokens: 61331210240 | elapsed time per iteration (s): 1.02 | learning rate: 2.205E-05 | global batch size: 256 | lm loss: 1.895693E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.236 | TFLOPs: 41.35 | 15: iteration 116990/ 125429 | consumed samples: 29949440 | consumed tokens: 61336453120 | elapsed time per iteration (s): 1.03 | learning rate: 2.204E-05 | global batch size: 256 | lm loss: 1.899170E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.013 | TFLOPs: 40.99 | 15: iteration 117000/ 125429 | consumed samples: 29952000 | consumed tokens: 61341696000 | elapsed time per iteration (s): 1.04 | learning rate: 2.204E-05 | global batch size: 256 | lm loss: 1.915466E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.700 | TFLOPs: 40.60 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 117000 | lm loss value: 1.864023E+00 | lm loss PPL: 6.449630E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 117000 to checkpoints_1b5 0: [2022-11-27 06:39:41,519] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step117000 is begin to save! 0: [2022-11-27 06:39:41,529] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:39:41,772] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:39:41,772] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:39:41,874] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:39:41,874] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:39:41,979] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:39:41,979] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:39:42,078] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:39:42,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:39:42,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:39:42,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:39:42,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:39:42,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:39:42,381] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:39:42,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:39:42,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:39:42,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:39:42,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:39:42,587] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:39:42,693] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:39:42,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:39:42,800] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:39:42,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:39:42,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:39:42,907] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:39:43,015] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:39:43,016] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:39:43,123] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:39:43,124] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:39:43,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:39:43,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:39:43,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:39:43,345] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:39:43,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:39:43,448] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:39:43,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:39:43,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:39:43,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:39:43,665] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:39:43,770] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:39:43,770] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:39:43,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:39:43,876] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:39:43,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:39:43,978] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:39:44,081] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:39:44,082] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:39:44,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:39:44,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:39:44,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:39:44,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:39:44,402] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:39:44,403] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:39:44,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:39:44,507] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_29-model_00-model_states.pt... 0: [2022-11-27 06:39:44,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_29-model_00-model_states.pt. 0: [2022-11-27 06:39:44,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:39:44,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:39:44,722] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/layer_32-model_00-model_states.pt... 0: [2022-11-27 06:39:44,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/layer_32-model_00-model_states.pt. 0: [2022-11-27 06:39:44,725] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step117000/mp_rank_00_model_states.pt 0: [2022-11-27 06:39:44,725] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:39:44,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/mp_rank_00_model_states.pt. 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:39:44,767] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step117000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:39:44,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-27 06:39:44,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,929] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,934] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,934] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-27 06:39:44,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-27 06:39:44,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:39:44,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:39:44,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,936] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:39:44,936] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,936] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-27 06:39:44,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,933] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,933] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-27 06:39:44,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,941] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,941] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-27 06:39:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-27 06:39:44,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-27 06:39:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,944] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,944] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-27 06:39:44,946] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,946] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,946] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-27 06:39:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-27 06:39:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-27 06:39:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-27 06:39:44,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-27 06:39:44,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-27 06:39:44,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:39:44,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-27 06:39:44,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-27 06:39:44,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,954] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-27 06:39:44,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:39:44,954] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,955] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,958] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,958] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:39:44,947] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:39:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,947] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,947] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:39:44,950] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,950] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:39:44,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:39:44,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,951] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 1: [2022-11-27 06:39:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:39:44,959] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 06:39:44,959] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,959] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:39:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,960] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,960] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:39:44,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:39:44,961] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,961] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:39:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 9: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 7: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:39:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:39:44,962] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,962] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-27 06:39:44,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,964] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:39:44,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 06:39:44,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 12: [2022-11-27 06:39:44,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,942] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:39:44,942] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,942] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:44,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 06:39:44,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:44,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:39:44,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:44,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,944] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:39:44,945] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:44,945] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,949] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:39:44,949] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:44,949] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:39:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:39:44,956] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:44,956] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:39:44,965] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:44,965] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:39:44,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 10: [2022-11-27 06:39:44,972] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:39:44,972] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:39:44,972] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,983] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,951] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,952] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-27 06:39:44,953] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,953] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,953] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-27 06:39:44,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 4: [2022-11-27 06:39:44,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:39:44,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 06:39:44,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-27 06:39:44,988] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,988] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,988] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 6: [2022-11-27 06:39:44,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:39:44,989] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 06:39:44,989] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:39:44,968] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,968] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,971] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:39:44,971] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,971] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 11: [2022-11-27 06:39:44,980] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:39:44,980] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:39:44,980] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,963] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,963] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,964] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,964] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 14: [2022-11-27 06:39:44,978] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:39:44,978] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 06:39:44,978] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 5: [2022-11-27 06:39:44,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:39:44,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:39:44,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: [2022-11-27 06:39:44,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 06:39:44,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 15: [2022-11-27 06:39:44,983] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:39:44,983] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,007] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 8: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:45,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:39:45,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:45,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 06:39:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 06:39:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 3: [2022-11-27 06:39:45,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:39:45,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 06:39:45,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 13: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:39:45,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:39:45,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step117000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 2: [2022-11-27 06:39:45,149] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step117000 is ready now! 0: successfully saved checkpoint at iteration 117000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3674.50 15: iteration 117010/ 125429 | consumed samples: 29954560 | consumed tokens: 61346938880 | elapsed time per iteration (s): 1.44 | learning rate: 2.203E-05 | global batch size: 256 | lm loss: 1.881540E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.359 | TFLOPs: 29.31 | 15: iteration 117020/ 125429 | consumed samples: 29957120 | consumed tokens: 61352181760 | elapsed time per iteration (s): 1.03 | learning rate: 2.203E-05 | global batch size: 256 | lm loss: 1.912219E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.879 | TFLOPs: 40.96 | 15: iteration 117030/ 125429 | consumed samples: 29959680 | consumed tokens: 61357424640 | elapsed time per iteration (s): 1.04 | learning rate: 2.202E-05 | global batch size: 256 | lm loss: 1.917500E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.256 | TFLOPs: 40.86 | 15: iteration 117040/ 125429 | consumed samples: 29962240 | consumed tokens: 61362667520 | elapsed time per iteration (s): 1.09 | learning rate: 2.202E-05 | global batch size: 256 | lm loss: 1.892784E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.778 | TFLOPs: 38.80 | 15: iteration 117050/ 125429 | consumed samples: 29964800 | consumed tokens: 61367910400 | elapsed time per iteration (s): 1.02 | learning rate: 2.201E-05 | global batch size: 256 | lm loss: 1.860925E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.259 | TFLOPs: 41.36 | 15: iteration 117060/ 125429 | consumed samples: 29967360 | consumed tokens: 61373153280 | elapsed time per iteration (s): 1.05 | learning rate: 2.201E-05 | global batch size: 256 | lm loss: 1.883432E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.909 | TFLOPs: 40.14 | 15: iteration 117070/ 125429 | consumed samples: 29969920 | consumed tokens: 61378396160 | elapsed time per iteration (s): 1.04 | learning rate: 2.201E-05 | global batch size: 256 | lm loss: 1.896128E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.035 | TFLOPs: 40.82 | 15: iteration 117080/ 125429 | consumed samples: 29972480 | consumed tokens: 61383639040 | elapsed time per iteration (s): 1.04 | learning rate: 2.200E-05 | global batch size: 256 | lm loss: 1.871492E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.109 | TFLOPs: 40.67 | 15: iteration 117090/ 125429 | consumed samples: 29975040 | consumed tokens: 61388881920 | elapsed time per iteration (s): 1.18 | learning rate: 2.200E-05 | global batch size: 256 | lm loss: 1.883811E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.063 | TFLOPs: 35.87 | 15: iteration 117100/ 125429 | consumed samples: 29977600 | consumed tokens: 61394124800 | elapsed time per iteration (s): 1.06 | learning rate: 2.199E-05 | global batch size: 256 | lm loss: 1.900389E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.102 | TFLOPs: 39.84 | 15: iteration 117110/ 125429 | consumed samples: 29980160 | consumed tokens: 61399367680 | elapsed time per iteration (s): 1.04 | learning rate: 2.199E-05 | global batch size: 256 | lm loss: 1.893437E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.378 | TFLOPs: 40.55 | 15: iteration 117120/ 125429 | consumed samples: 29982720 | consumed tokens: 61404610560 | elapsed time per iteration (s): 1.06 | learning rate: 2.198E-05 | global batch size: 256 | lm loss: 1.895662E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.109 | TFLOPs: 40.01 | 15: iteration 117130/ 125429 | consumed samples: 29985280 | consumed tokens: 61409853440 | elapsed time per iteration (s): 1.05 | learning rate: 2.198E-05 | global batch size: 256 | lm loss: 1.863819E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.418 | TFLOPs: 40.39 | 15: iteration 117140/ 125429 | consumed samples: 29987840 | consumed tokens: 61415096320 | elapsed time per iteration (s): 1.05 | learning rate: 2.197E-05 | global batch size: 256 | lm loss: 1.893703E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.045 | TFLOPs: 40.33 | 15: iteration 117150/ 125429 | consumed samples: 29990400 | consumed tokens: 61420339200 | elapsed time per iteration (s): 1.08 | learning rate: 2.197E-05 | global batch size: 256 | lm loss: 1.887971E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.287 | TFLOPs: 39.05 | 15: iteration 117160/ 125429 | consumed samples: 29992960 | consumed tokens: 61425582080 | elapsed time per iteration (s): 1.03 | learning rate: 2.196E-05 | global batch size: 256 | lm loss: 1.895437E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.619 | TFLOPs: 40.92 | 15: iteration 117170/ 125429 | consumed samples: 29995520 | consumed tokens: 61430824960 | elapsed time per iteration (s): 1.04 | learning rate: 2.196E-05 | global batch size: 256 | lm loss: 1.899537E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.187 | TFLOPs: 40.52 | 15: iteration 117180/ 125429 | consumed samples: 29998080 | consumed tokens: 61436067840 | elapsed time per iteration (s): 1.02 | learning rate: 2.195E-05 | global batch size: 256 | lm loss: 1.914056E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.365 | TFLOPs: 41.37 | 15: iteration 117190/ 125429 | consumed samples: 30000640 | consumed tokens: 61441310720 | elapsed time per iteration (s): 1.06 | learning rate: 2.195E-05 | global batch size: 256 | lm loss: 1.889229E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.125 | TFLOPs: 40.01 | 15: iteration 117200/ 125429 | consumed samples: 30003200 | consumed tokens: 61446553600 | elapsed time per iteration (s): 1.04 | learning rate: 2.194E-05 | global batch size: 256 | lm loss: 1.904528E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.083 | TFLOPs: 40.67 | 15: iteration 117210/ 125429 | consumed samples: 30005760 | consumed tokens: 61451796480 | elapsed time per iteration (s): 1.02 | learning rate: 2.194E-05 | global batch size: 256 | lm loss: 1.890841E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.805 | TFLOPs: 41.45 | 15: iteration 117220/ 125429 | consumed samples: 30008320 | consumed tokens: 61457039360 | elapsed time per iteration (s): 1.04 | learning rate: 2.193E-05 | global batch size: 256 | lm loss: 1.875710E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.161 | TFLOPs: 40.85 | 15: iteration 117230/ 125429 | consumed samples: 30010880 | consumed tokens: 61462282240 | elapsed time per iteration (s): 1.04 | learning rate: 2.193E-05 | global batch size: 256 | lm loss: 1.877674E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.036 | TFLOPs: 40.66 | 15: iteration 117240/ 125429 | consumed samples: 30013440 | consumed tokens: 61467525120 | elapsed time per iteration (s): 1.05 | learning rate: 2.192E-05 | global batch size: 256 | lm loss: 1.908508E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.991 | TFLOPs: 40.16 | 15: iteration 117250/ 125429 | consumed samples: 30016000 | consumed tokens: 61472768000 | elapsed time per iteration (s): 1.03 | learning rate: 2.192E-05 | global batch size: 256 | lm loss: 1.897206E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.506 | TFLOPs: 40.90 | 15: iteration 117260/ 125429 | consumed samples: 30018560 | consumed tokens: 61478010880 | elapsed time per iteration (s): 1.08 | learning rate: 2.192E-05 | global batch size: 256 | lm loss: 1.886272E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.725 | TFLOPs: 39.12 | 15: iteration 117270/ 125429 | consumed samples: 30021120 | consumed tokens: 61483253760 | elapsed time per iteration (s): 1.04 | learning rate: 2.191E-05 | global batch size: 256 | lm loss: 1.874267E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.129 | TFLOPs: 40.84 | 15: iteration 117280/ 125429 | consumed samples: 30023680 | consumed tokens: 61488496640 | elapsed time per iteration (s): 1.02 | learning rate: 2.191E-05 | global batch size: 256 | lm loss: 1.924084E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.869 | TFLOPs: 41.46 | 15: iteration 117290/ 125429 | consumed samples: 30026240 | consumed tokens: 61493739520 | elapsed time per iteration (s): 1.06 | learning rate: 2.190E-05 | global batch size: 256 | lm loss: 1.898898E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.295 | TFLOPs: 39.88 | 15: iteration 117300/ 125429 | consumed samples: 30028800 | consumed tokens: 61498982400 | elapsed time per iteration (s): 1.05 | learning rate: 2.190E-05 | global batch size: 256 | lm loss: 1.892368E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.476 | TFLOPs: 40.40 | 15: iteration 117310/ 125429 | consumed samples: 30031360 | consumed tokens: 61504225280 | elapsed time per iteration (s): 1.03 | learning rate: 2.189E-05 | global batch size: 256 | lm loss: 1.898637E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.462 | TFLOPs: 40.89 | 15: iteration 117320/ 125429 | consumed samples: 30033920 | consumed tokens: 61509468160 | elapsed time per iteration (s): 1.03 | learning rate: 2.189E-05 | global batch size: 256 | lm loss: 1.920214E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.446 | TFLOPs: 41.06 | 15: iteration 117330/ 125429 | consumed samples: 30036480 | consumed tokens: 61514711040 | elapsed time per iteration (s): 1.02 | learning rate: 2.188E-05 | global batch size: 256 | lm loss: 1.879865E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.197 | TFLOPs: 41.51 | 15: iteration 117340/ 125429 | consumed samples: 30039040 | consumed tokens: 61519953920 | elapsed time per iteration (s): 1.05 | learning rate: 2.188E-05 | global batch size: 256 | lm loss: 1.894170E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.019 | TFLOPs: 40.33 | 15: iteration 117350/ 125429 | consumed samples: 30041600 | consumed tokens: 61525196800 | elapsed time per iteration (s): 1.03 | learning rate: 2.187E-05 | global batch size: 256 | lm loss: 1.915847E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.772 | TFLOPs: 41.11 | 15: iteration 117360/ 125429 | consumed samples: 30044160 | consumed tokens: 61530439680 | elapsed time per iteration (s): 1.02 | learning rate: 2.187E-05 | global batch size: 256 | lm loss: 1.892858E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.939 | TFLOPs: 41.63 | 15: iteration 117370/ 125429 | consumed samples: 30046720 | consumed tokens: 61535682560 | elapsed time per iteration (s): 1.02 | learning rate: 2.186E-05 | global batch size: 256 | lm loss: 1.899116E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.145 | TFLOPs: 41.34 | 15: iteration 117380/ 125429 | consumed samples: 30049280 | consumed tokens: 61540925440 | elapsed time per iteration (s): 1.03 | learning rate: 2.186E-05 | global batch size: 256 | lm loss: 1.901193E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.423 | TFLOPs: 41.22 | 15: iteration 117390/ 125429 | consumed samples: 30051840 | consumed tokens: 61546168320 | elapsed time per iteration (s): 1.02 | learning rate: 2.186E-05 | global batch size: 256 | lm loss: 1.888094E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.893 | TFLOPs: 41.63 | 15: iteration 117400/ 125429 | consumed samples: 30054400 | consumed tokens: 61551411200 | elapsed time per iteration (s): 1.19 | learning rate: 2.185E-05 | global batch size: 256 | lm loss: 1.906012E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.526 | TFLOPs: 35.45 | 15: iteration 117410/ 125429 | consumed samples: 30056960 | consumed tokens: 61556654080 | elapsed time per iteration (s): 1.04 | learning rate: 2.185E-05 | global batch size: 256 | lm loss: 1.880581E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.959 | TFLOPs: 40.65 | 15: iteration 117420/ 125429 | consumed samples: 30059520 | consumed tokens: 61561896960 | elapsed time per iteration (s): 1.19 | learning rate: 2.184E-05 | global batch size: 256 | lm loss: 1.895457E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.297 | TFLOPs: 35.41 | 15: iteration 117430/ 125429 | consumed samples: 30062080 | consumed tokens: 61567139840 | elapsed time per iteration (s): 1.09 | learning rate: 2.184E-05 | global batch size: 256 | lm loss: 1.868900E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.808 | TFLOPs: 38.80 | 15: iteration 117440/ 125429 | consumed samples: 30064640 | consumed tokens: 61572382720 | elapsed time per iteration (s): 1.06 | learning rate: 2.183E-05 | global batch size: 256 | lm loss: 1.895201E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.972 | TFLOPs: 39.99 | 15: iteration 117450/ 125429 | consumed samples: 30067200 | consumed tokens: 61577625600 | elapsed time per iteration (s): 1.04 | learning rate: 2.183E-05 | global batch size: 256 | lm loss: 1.884567E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.572 | TFLOPs: 40.75 | 15: iteration 117460/ 125429 | consumed samples: 30069760 | consumed tokens: 61582868480 | elapsed time per iteration (s): 1.08 | learning rate: 2.182E-05 | global batch size: 256 | lm loss: 1.910708E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.034 | TFLOPs: 39.34 | 15: iteration 117470/ 125429 | consumed samples: 30072320 | consumed tokens: 61588111360 | elapsed time per iteration (s): 1.03 | learning rate: 2.182E-05 | global batch size: 256 | lm loss: 1.892568E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.543 | TFLOPs: 40.91 | 15: iteration 117480/ 125429 | consumed samples: 30074880 | consumed tokens: 61593354240 | elapsed time per iteration (s): 1.05 | learning rate: 2.181E-05 | global batch size: 256 | lm loss: 1.871475E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.867 | TFLOPs: 40.30 | 15: iteration 117490/ 125429 | consumed samples: 30077440 | consumed tokens: 61598597120 | elapsed time per iteration (s): 1.03 | learning rate: 2.181E-05 | global batch size: 256 | lm loss: 1.886308E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.519 | TFLOPs: 41.24 | 15: iteration 117500/ 125429 | consumed samples: 30080000 | consumed tokens: 61603840000 | elapsed time per iteration (s): 1.08 | learning rate: 2.180E-05 | global batch size: 256 | lm loss: 1.888565E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.149 | TFLOPs: 39.19 | 15: iteration 117510/ 125429 | consumed samples: 30082560 | consumed tokens: 61609082880 | elapsed time per iteration (s): 1.09 | learning rate: 2.180E-05 | global batch size: 256 | lm loss: 1.912280E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.812 | TFLOPs: 38.97 | 15: iteration 117520/ 125429 | consumed samples: 30085120 | consumed tokens: 61614325760 | elapsed time per iteration (s): 1.06 | learning rate: 2.180E-05 | global batch size: 256 | lm loss: 1.895080E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.503 | TFLOPs: 39.91 | 15: iteration 117530/ 125429 | consumed samples: 30087680 | consumed tokens: 61619568640 | elapsed time per iteration (s): 1.04 | learning rate: 2.179E-05 | global batch size: 256 | lm loss: 1.875969E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.277 | TFLOPs: 40.86 | 15: iteration 117540/ 125429 | consumed samples: 30090240 | consumed tokens: 61624811520 | elapsed time per iteration (s): 1.05 | learning rate: 2.179E-05 | global batch size: 256 | lm loss: 1.892810E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.395 | TFLOPs: 40.39 | 15: iteration 117550/ 125429 | consumed samples: 30092800 | consumed tokens: 61630054400 | elapsed time per iteration (s): 1.04 | learning rate: 2.178E-05 | global batch size: 256 | lm loss: 1.915430E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.953 | TFLOPs: 40.81 | 15: iteration 117560/ 125429 | consumed samples: 30095360 | consumed tokens: 61635297280 | elapsed time per iteration (s): 1.04 | learning rate: 2.178E-05 | global batch size: 256 | lm loss: 1.904809E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.374 | TFLOPs: 40.55 | 15: iteration 117570/ 125429 | consumed samples: 30097920 | consumed tokens: 61640540160 | elapsed time per iteration (s): 1.05 | learning rate: 2.177E-05 | global batch size: 256 | lm loss: 1.917858E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.446 | TFLOPs: 40.23 | 15: iteration 117580/ 125429 | consumed samples: 30100480 | consumed tokens: 61645783040 | elapsed time per iteration (s): 1.02 | learning rate: 2.177E-05 | global batch size: 256 | lm loss: 1.909913E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.128 | TFLOPs: 41.34 | 15: iteration 117590/ 125429 | consumed samples: 30103040 | consumed tokens: 61651025920 | elapsed time per iteration (s): 1.09 | learning rate: 2.176E-05 | global batch size: 256 | lm loss: 1.878978E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.163 | TFLOPs: 38.70 | 15: iteration 117600/ 125429 | consumed samples: 30105600 | consumed tokens: 61656268800 | elapsed time per iteration (s): 1.04 | learning rate: 2.176E-05 | global batch size: 256 | lm loss: 1.888091E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.257 | TFLOPs: 40.53 | 15: iteration 117610/ 125429 | consumed samples: 30108160 | consumed tokens: 61661511680 | elapsed time per iteration (s): 1.07 | learning rate: 2.176E-05 | global batch size: 256 | lm loss: 1.884798E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.239 | TFLOPs: 39.70 | 15: iteration 117620/ 125429 | consumed samples: 30110720 | consumed tokens: 61666754560 | elapsed time per iteration (s): 1.05 | learning rate: 2.175E-05 | global batch size: 256 | lm loss: 1.895879E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.189 | TFLOPs: 40.19 | 15: iteration 117630/ 125429 | consumed samples: 30113280 | consumed tokens: 61671997440 | elapsed time per iteration (s): 1.05 | learning rate: 2.175E-05 | global batch size: 256 | lm loss: 1.910820E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.378 | TFLOPs: 40.22 | 15: iteration 117640/ 125429 | consumed samples: 30115840 | consumed tokens: 61677240320 | elapsed time per iteration (s): 1.04 | learning rate: 2.174E-05 | global batch size: 256 | lm loss: 1.920863E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.495 | TFLOPs: 40.57 | 15: iteration 117650/ 125429 | consumed samples: 30118400 | consumed tokens: 61682483200 | elapsed time per iteration (s): 1.03 | learning rate: 2.174E-05 | global batch size: 256 | lm loss: 1.851891E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.399 | TFLOPs: 40.88 | 15: iteration 117660/ 125429 | consumed samples: 30120960 | consumed tokens: 61687726080 | elapsed time per iteration (s): 1.03 | learning rate: 2.173E-05 | global batch size: 256 | lm loss: 1.912930E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.136 | TFLOPs: 41.17 | 15: iteration 117670/ 125429 | consumed samples: 30123520 | consumed tokens: 61692968960 | elapsed time per iteration (s): 1.06 | learning rate: 2.173E-05 | global batch size: 256 | lm loss: 1.929231E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.529 | TFLOPs: 40.08 | 15: iteration 117680/ 125429 | consumed samples: 30126080 | consumed tokens: 61698211840 | elapsed time per iteration (s): 1.03 | learning rate: 2.172E-05 | global batch size: 256 | lm loss: 1.918462E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.719 | TFLOPs: 41.27 | 15: iteration 117690/ 125429 | consumed samples: 30128640 | consumed tokens: 61703454720 | elapsed time per iteration (s): 1.04 | learning rate: 2.172E-05 | global batch size: 256 | lm loss: 1.872536E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.202 | TFLOPs: 40.85 | 15: iteration 117700/ 125429 | consumed samples: 30131200 | consumed tokens: 61708697600 | elapsed time per iteration (s): 1.04 | learning rate: 2.172E-05 | global batch size: 256 | lm loss: 1.897176E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.858 | TFLOPs: 40.63 | 15: iteration 117710/ 125429 | consumed samples: 30133760 | consumed tokens: 61713940480 | elapsed time per iteration (s): 1.19 | learning rate: 2.171E-05 | global batch size: 256 | lm loss: 1.890094E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.426 | TFLOPs: 35.60 | 15: iteration 117720/ 125429 | consumed samples: 30136320 | consumed tokens: 61719183360 | elapsed time per iteration (s): 1.02 | learning rate: 2.171E-05 | global batch size: 256 | lm loss: 1.916111E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.685 | TFLOPs: 41.43 | 15: iteration 117730/ 125429 | consumed samples: 30138880 | consumed tokens: 61724426240 | elapsed time per iteration (s): 1.04 | learning rate: 2.170E-05 | global batch size: 256 | lm loss: 1.903333E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.771 | TFLOPs: 40.78 | 15: iteration 117740/ 125429 | consumed samples: 30141440 | consumed tokens: 61729669120 | elapsed time per iteration (s): 1.03 | learning rate: 2.170E-05 | global batch size: 256 | lm loss: 1.877429E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.031 | TFLOPs: 41.15 | 15: iteration 117750/ 125429 | consumed samples: 30144000 | consumed tokens: 61734912000 | elapsed time per iteration (s): 1.02 | learning rate: 2.169E-05 | global batch size: 256 | lm loss: 1.888097E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.759 | TFLOPs: 41.27 | 15: iteration 117760/ 125429 | consumed samples: 30146560 | consumed tokens: 61740154880 | elapsed time per iteration (s): 1.03 | learning rate: 2.169E-05 | global batch size: 256 | lm loss: 1.888739E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.189 | TFLOPs: 41.18 | 15: iteration 117770/ 125429 | consumed samples: 30149120 | consumed tokens: 61745397760 | elapsed time per iteration (s): 1.04 | learning rate: 2.168E-05 | global batch size: 256 | lm loss: 1.899703E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.590 | TFLOPs: 40.59 | 15: iteration 117780/ 125429 | consumed samples: 30151680 | consumed tokens: 61750640640 | elapsed time per iteration (s): 1.05 | learning rate: 2.168E-05 | global batch size: 256 | lm loss: 1.884282E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.646 | TFLOPs: 40.43 | 15: iteration 117790/ 125429 | consumed samples: 30154240 | consumed tokens: 61755883520 | elapsed time per iteration (s): 1.07 | learning rate: 2.168E-05 | global batch size: 256 | lm loss: 1.879278E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.257 | TFLOPs: 39.54 | 15: iteration 117800/ 125429 | consumed samples: 30156800 | consumed tokens: 61761126400 | elapsed time per iteration (s): 1.19 | learning rate: 2.167E-05 | global batch size: 256 | lm loss: 1.903564E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.778 | TFLOPs: 35.66 | 15: iteration 117810/ 125429 | consumed samples: 30159360 | consumed tokens: 61766369280 | elapsed time per iteration (s): 1.05 | learning rate: 2.167E-05 | global batch size: 256 | lm loss: 1.878377E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.690 | TFLOPs: 40.44 | 15: iteration 117820/ 125429 | consumed samples: 30161920 | consumed tokens: 61771612160 | elapsed time per iteration (s): 1.03 | learning rate: 2.166E-05 | global batch size: 256 | lm loss: 1.886865E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.770 | TFLOPs: 40.95 | 15: iteration 117830/ 125429 | consumed samples: 30164480 | consumed tokens: 61776855040 | elapsed time per iteration (s): 1.20 | learning rate: 2.166E-05 | global batch size: 256 | lm loss: 1.886839E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.049 | TFLOPs: 35.37 | 15: iteration 117840/ 125429 | consumed samples: 30167040 | consumed tokens: 61782097920 | elapsed time per iteration (s): 1.04 | learning rate: 2.165E-05 | global batch size: 256 | lm loss: 1.901998E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.060 | TFLOPs: 40.66 | 15: iteration 117850/ 125429 | consumed samples: 30169600 | consumed tokens: 61787340800 | elapsed time per iteration (s): 1.03 | learning rate: 2.165E-05 | global batch size: 256 | lm loss: 1.860307E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.633 | TFLOPs: 41.09 | 15: iteration 117860/ 125429 | consumed samples: 30172160 | consumed tokens: 61792583680 | elapsed time per iteration (s): 1.07 | learning rate: 2.165E-05 | global batch size: 256 | lm loss: 1.878100E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.682 | TFLOPs: 39.61 | 15: iteration 117870/ 125429 | consumed samples: 30174720 | consumed tokens: 61797826560 | elapsed time per iteration (s): 1.05 | learning rate: 2.164E-05 | global batch size: 256 | lm loss: 1.874714E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.074 | TFLOPs: 40.17 | 15: iteration 117880/ 125429 | consumed samples: 30177280 | consumed tokens: 61803069440 | elapsed time per iteration (s): 1.02 | learning rate: 2.164E-05 | global batch size: 256 | lm loss: 1.893916E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.481 | TFLOPs: 41.39 | 15: iteration 117890/ 125429 | consumed samples: 30179840 | consumed tokens: 61808312320 | elapsed time per iteration (s): 1.03 | learning rate: 2.163E-05 | global batch size: 256 | lm loss: 1.885138E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.569 | TFLOPs: 40.91 | 15: iteration 117900/ 125429 | consumed samples: 30182400 | consumed tokens: 61813555200 | elapsed time per iteration (s): 1.07 | learning rate: 2.163E-05 | global batch size: 256 | lm loss: 1.893955E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.619 | TFLOPs: 39.60 | 15: iteration 117910/ 125429 | consumed samples: 30184960 | consumed tokens: 61818798080 | elapsed time per iteration (s): 1.04 | learning rate: 2.162E-05 | global batch size: 256 | lm loss: 1.886437E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.926 | TFLOPs: 40.64 | 15: iteration 117920/ 125429 | consumed samples: 30187520 | consumed tokens: 61824040960 | elapsed time per iteration (s): 1.05 | learning rate: 2.162E-05 | global batch size: 256 | lm loss: 1.906227E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.809 | TFLOPs: 40.13 | 15: iteration 117930/ 125429 | consumed samples: 30190080 | consumed tokens: 61829283840 | elapsed time per iteration (s): 1.07 | learning rate: 2.161E-05 | global batch size: 256 | lm loss: 1.863733E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.321 | TFLOPs: 39.71 | 15: iteration 117940/ 125429 | consumed samples: 30192640 | consumed tokens: 61834526720 | elapsed time per iteration (s): 1.09 | learning rate: 2.161E-05 | global batch size: 256 | lm loss: 1.893646E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.830 | TFLOPs: 38.64 | 15: iteration 117950/ 125429 | consumed samples: 30195200 | consumed tokens: 61839769600 | elapsed time per iteration (s): 1.07 | learning rate: 2.161E-05 | global batch size: 256 | lm loss: 1.872574E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.781 | TFLOPs: 39.63 | 15: iteration 117960/ 125429 | consumed samples: 30197760 | consumed tokens: 61845012480 | elapsed time per iteration (s): 1.03 | learning rate: 2.160E-05 | global batch size: 256 | lm loss: 1.896102E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.820 | TFLOPs: 41.12 | 15: iteration 117970/ 125429 | consumed samples: 30200320 | consumed tokens: 61850255360 | elapsed time per iteration (s): 1.04 | learning rate: 2.160E-05 | global batch size: 256 | lm loss: 1.926354E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.307 | TFLOPs: 40.87 | 15: iteration 117980/ 125429 | consumed samples: 30202880 | consumed tokens: 61855498240 | elapsed time per iteration (s): 1.05 | learning rate: 2.159E-05 | global batch size: 256 | lm loss: 1.918867E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.352 | TFLOPs: 40.38 | 15: iteration 117990/ 125429 | consumed samples: 30205440 | consumed tokens: 61860741120 | elapsed time per iteration (s): 1.05 | learning rate: 2.159E-05 | global batch size: 256 | lm loss: 1.894417E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.517 | TFLOPs: 40.24 | 0: [2022-11-27 06:57:19,077] [INFO] [logging.py:68:log_dist] [Rank 0] step=118000, skipped=0, lr=[2.1585018064030134e-05, 2.1585018064030134e-05, 2.1585018064030134e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 118000/ 125429 | consumed samples: 30208000 | consumed tokens: 61865984000 | elapsed time per iteration (s): 1.06 | learning rate: 2.159E-05 | global batch size: 256 | lm loss: 1.885960E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.166 | TFLOPs: 39.85 | 0: steps: 118000 loss: 1.8242 iter time (s): 1.067 samples/sec: 239.918 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 118000 | lm loss value: 1.851613E+00 | lm loss PPL: 6.370086E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 118000 to checkpoints_1b5 0: [2022-11-27 06:57:19,444] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step118000 is begin to save! 0: [2022-11-27 06:57:19,451] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_01-model_00-model_states.pt... 0: [2022-11-27 06:57:19,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_01-model_00-model_states.pt. 0: [2022-11-27 06:57:19,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_03-model_00-model_states.pt... 0: [2022-11-27 06:57:19,823] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_03-model_00-model_states.pt. 0: [2022-11-27 06:57:19,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_04-model_00-model_states.pt... 0: [2022-11-27 06:57:19,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_04-model_00-model_states.pt. 0: [2022-11-27 06:57:19,931] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_05-model_00-model_states.pt... 0: [2022-11-27 06:57:20,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_05-model_00-model_states.pt. 0: [2022-11-27 06:57:20,042] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_06-model_00-model_states.pt... 0: [2022-11-27 06:57:20,148] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_06-model_00-model_states.pt. 0: [2022-11-27 06:57:20,148] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_07-model_00-model_states.pt... 0: [2022-11-27 06:57:20,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_07-model_00-model_states.pt. 0: [2022-11-27 06:57:20,253] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_08-model_00-model_states.pt... 0: [2022-11-27 06:57:20,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_08-model_00-model_states.pt. 0: [2022-11-27 06:57:20,358] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_09-model_00-model_states.pt... 0: [2022-11-27 06:57:20,464] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_09-model_00-model_states.pt. 0: [2022-11-27 06:57:20,465] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_10-model_00-model_states.pt... 0: [2022-11-27 06:57:20,569] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_10-model_00-model_states.pt. 0: [2022-11-27 06:57:20,570] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_11-model_00-model_states.pt... 0: [2022-11-27 06:57:20,674] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_11-model_00-model_states.pt. 0: [2022-11-27 06:57:20,674] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_12-model_00-model_states.pt... 0: [2022-11-27 06:57:20,780] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_12-model_00-model_states.pt. 0: [2022-11-27 06:57:20,780] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_13-model_00-model_states.pt... 0: [2022-11-27 06:57:20,885] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_13-model_00-model_states.pt. 0: [2022-11-27 06:57:20,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_14-model_00-model_states.pt... 0: [2022-11-27 06:57:20,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_14-model_00-model_states.pt. 0: [2022-11-27 06:57:20,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_15-model_00-model_states.pt... 0: [2022-11-27 06:57:21,095] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_15-model_00-model_states.pt. 0: [2022-11-27 06:57:21,096] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_16-model_00-model_states.pt... 0: [2022-11-27 06:57:21,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_16-model_00-model_states.pt. 0: [2022-11-27 06:57:21,202] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_17-model_00-model_states.pt... 0: [2022-11-27 06:57:21,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_17-model_00-model_states.pt. 0: [2022-11-27 06:57:21,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_18-model_00-model_states.pt... 0: [2022-11-27 06:57:21,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_18-model_00-model_states.pt. 0: [2022-11-27 06:57:21,416] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_19-model_00-model_states.pt... 0: [2022-11-27 06:57:21,525] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_19-model_00-model_states.pt. 0: [2022-11-27 06:57:21,525] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_20-model_00-model_states.pt... 0: [2022-11-27 06:57:21,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_20-model_00-model_states.pt. 0: [2022-11-27 06:57:21,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_21-model_00-model_states.pt... 0: [2022-11-27 06:57:21,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_21-model_00-model_states.pt. 0: [2022-11-27 06:57:21,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_22-model_00-model_states.pt... 0: [2022-11-27 06:57:21,843] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_22-model_00-model_states.pt. 0: [2022-11-27 06:57:21,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_23-model_00-model_states.pt... 0: [2022-11-27 06:57:21,948] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_23-model_00-model_states.pt. 0: [2022-11-27 06:57:21,948] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_24-model_00-model_states.pt... 0: [2022-11-27 06:57:22,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_24-model_00-model_states.pt. 0: [2022-11-27 06:57:22,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_25-model_00-model_states.pt... 0: [2022-11-27 06:57:22,160] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_25-model_00-model_states.pt. 0: [2022-11-27 06:57:22,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_26-model_00-model_states.pt... 0: [2022-11-27 06:57:22,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_26-model_00-model_states.pt. 0: [2022-11-27 06:57:22,265] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_27-model_00-model_states.pt... 0: [2022-11-27 06:57:22,368] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_27-model_00-model_states.pt. 0: [2022-11-27 06:57:22,369] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_28-model_00-model_states.pt... 0: [2022-11-27 06:57:22,473] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_28-model_00-model_states.pt. 0: [2022-11-27 06:57:22,473] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_29-model_00-model_states.pt... 0: [2022-11-27 06:57:22,577] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_29-model_00-model_states.pt. 0: [2022-11-27 06:57:22,577] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_30-model_00-model_states.pt... 0: [2022-11-27 06:57:22,682] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_30-model_00-model_states.pt. 0: [2022-11-27 06:57:22,682] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/layer_32-model_00-model_states.pt... 0: [2022-11-27 06:57:22,685] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/layer_32-model_00-model_states.pt. 0: [2022-11-27 06:57:22,687] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step118000/mp_rank_00_model_states.pt 0: [2022-11-27 06:57:22,687] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/mp_rank_00_model_states.pt... 0: [2022-11-27 06:57:22,691] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/mp_rank_00_model_states.pt. 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 13: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 06:57:22,729] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step118000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 0: [2022-11-27 06:57:22,890] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,896] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,894] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,894] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,901] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,901] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-27 06:57:22,897] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 7: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,897] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 7: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:57:22,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:57:22,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-27 06:57:22,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,908] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,908] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,908] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:22,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:22,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,911] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,911] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,911] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,904] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-27 06:57:22,904] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,905] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,907] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-27 06:57:22,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-27 06:57:22,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,907] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,907] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,909] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:57:22,909] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,909] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,910] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:57:22,910] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,910] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-27 06:57:22,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-27 06:57:22,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-27 06:57:22,895] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,895] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,895] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,900] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,900] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,914] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 0: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,914] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:22,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:22,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:22,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,914] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 4: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 06:57:22,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,915] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,915] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,917] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,917] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,917] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-27 06:57:22,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:57:22,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 06:57:22,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,924] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,925] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 10: [2022-11-27 06:57:22,926] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 06:57:22,926] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 06:57:22,926] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,927] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 06:57:22,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,927] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,927] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 3: [2022-11-27 06:57:22,928] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 06:57:22,928] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 06:57:22,928] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,925] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,925] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,929] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 12: [2022-11-27 06:57:22,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 06:57:22,930] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 06:57:22,930] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-27 06:57:22,896] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,896] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-27 06:57:22,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,899] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,899] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-27 06:57:22,902] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,902] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,902] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-27 06:57:22,903] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,903] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,903] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 11: [2022-11-27 06:57:22,930] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 06:57:22,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 06:57:22,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,931] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 06:57:22,931] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,931] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 06:57:22,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 6: [2022-11-27 06:57:22,932] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 06:57:22,932] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 06:57:22,932] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 7: [2022-11-27 06:57:22,935] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 06:57:22,935] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 06:57:22,935] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,937] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 06:57:22,937] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 06:57:22,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,937] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 15: [2022-11-27 06:57:22,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 06:57:22,938] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 06:57:22,938] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 2: [2022-11-27 06:57:22,940] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 06:57:22,940] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 06:57:22,940] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,912] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,912] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,913] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,913] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,918] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,918] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,918] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,919] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 14: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 06:57:22,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,905] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,906] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,906] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,920] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,920] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 1: [2022-11-27 06:57:22,943] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 06:57:22,943] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 06:57:22,943] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 06:57:22,938] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,939] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,922] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 8: [2022-11-27 06:57:22,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-27 06:57:22,922] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 8: [2022-11-27 06:57:22,939] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 13: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 06:57:22,923] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 06:57:22,923] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 06:57:22,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 5: [2022-11-27 06:57:22,987] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 06:57:22,987] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 06:57:22,987] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: [2022-11-27 06:57:23,072] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 06:57:23,072] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 06:57:23,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step118000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 06:57:23,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step118000 is ready now! 0: successfully saved checkpoint at iteration 118000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3770.55 15: iteration 118010/ 125429 | consumed samples: 30210560 | consumed tokens: 61871226880 | elapsed time per iteration (s): 1.42 | learning rate: 2.158E-05 | global batch size: 256 | lm loss: 1.892235E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 180.326 | TFLOPs: 29.80 | 15: iteration 118020/ 125429 | consumed samples: 30213120 | consumed tokens: 61876469760 | elapsed time per iteration (s): 1.02 | learning rate: 2.158E-05 | global batch size: 256 | lm loss: 1.897099E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.018 | TFLOPs: 41.32 | 15: iteration 118030/ 125429 | consumed samples: 30215680 | consumed tokens: 61881712640 | elapsed time per iteration (s): 1.03 | learning rate: 2.157E-05 | global batch size: 256 | lm loss: 1.903441E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.135 | TFLOPs: 41.01 | 15: iteration 118040/ 125429 | consumed samples: 30218240 | consumed tokens: 61886955520 | elapsed time per iteration (s): 1.05 | learning rate: 2.157E-05 | global batch size: 256 | lm loss: 1.889584E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.954 | TFLOPs: 40.48 | 15: iteration 118050/ 125429 | consumed samples: 30220800 | consumed tokens: 61892198400 | elapsed time per iteration (s): 1.04 | learning rate: 2.156E-05 | global batch size: 256 | lm loss: 1.901279E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.092 | TFLOPs: 40.83 | 15: iteration 118060/ 125429 | consumed samples: 30223360 | consumed tokens: 61897441280 | elapsed time per iteration (s): 1.05 | learning rate: 2.156E-05 | global batch size: 256 | lm loss: 1.881707E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.776 | TFLOPs: 40.12 | 15: iteration 118070/ 125429 | consumed samples: 30225920 | consumed tokens: 61902684160 | elapsed time per iteration (s): 1.02 | learning rate: 2.156E-05 | global batch size: 256 | lm loss: 1.861490E+00 | grad norm: 0.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.315 | TFLOPs: 41.37 | 15: iteration 118080/ 125429 | consumed samples: 30228480 | consumed tokens: 61907927040 | elapsed time per iteration (s): 1.04 | learning rate: 2.155E-05 | global batch size: 256 | lm loss: 1.901114E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.439 | TFLOPs: 40.56 | 15: iteration 118090/ 125429 | consumed samples: 30231040 | consumed tokens: 61913169920 | elapsed time per iteration (s): 1.03 | learning rate: 2.155E-05 | global batch size: 256 | lm loss: 1.908196E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.695 | TFLOPs: 40.93 | 15: iteration 118100/ 125429 | consumed samples: 30233600 | consumed tokens: 61918412800 | elapsed time per iteration (s): 1.10 | learning rate: 2.154E-05 | global batch size: 256 | lm loss: 1.872137E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.512 | TFLOPs: 38.59 | 15: iteration 118110/ 125429 | consumed samples: 30236160 | consumed tokens: 61923655680 | elapsed time per iteration (s): 1.08 | learning rate: 2.154E-05 | global batch size: 256 | lm loss: 1.879592E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.483 | TFLOPs: 39.08 | 15: iteration 118120/ 125429 | consumed samples: 30238720 | consumed tokens: 61928898560 | elapsed time per iteration (s): 1.04 | learning rate: 2.153E-05 | global batch size: 256 | lm loss: 1.878338E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.091 | TFLOPs: 40.83 | 15: iteration 118130/ 125429 | consumed samples: 30241280 | consumed tokens: 61934141440 | elapsed time per iteration (s): 1.03 | learning rate: 2.153E-05 | global batch size: 256 | lm loss: 1.879030E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.069 | TFLOPs: 41.00 | 15: iteration 118140/ 125429 | consumed samples: 30243840 | consumed tokens: 61939384320 | elapsed time per iteration (s): 1.04 | learning rate: 2.153E-05 | global batch size: 256 | lm loss: 1.913533E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.940 | TFLOPs: 40.81 | 15: iteration 118150/ 125429 | consumed samples: 30246400 | consumed tokens: 61944627200 | elapsed time per iteration (s): 1.02 | learning rate: 2.152E-05 | global batch size: 256 | lm loss: 1.908269E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.611 | TFLOPs: 41.58 | 15: iteration 118160/ 125429 | consumed samples: 30248960 | consumed tokens: 61949870080 | elapsed time per iteration (s): 1.02 | learning rate: 2.152E-05 | global batch size: 256 | lm loss: 1.902615E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.555 | TFLOPs: 41.41 | 15: iteration 118170/ 125429 | consumed samples: 30251520 | consumed tokens: 61955112960 | elapsed time per iteration (s): 1.03 | learning rate: 2.151E-05 | global batch size: 256 | lm loss: 1.881034E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.547 | TFLOPs: 40.91 | 15: iteration 118180/ 125429 | consumed samples: 30254080 | consumed tokens: 61960355840 | elapsed time per iteration (s): 1.04 | learning rate: 2.151E-05 | global batch size: 256 | lm loss: 1.892938E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.601 | TFLOPs: 40.59 | 15: iteration 118190/ 125429 | consumed samples: 30256640 | consumed tokens: 61965598720 | elapsed time per iteration (s): 1.03 | learning rate: 2.151E-05 | global batch size: 256 | lm loss: 1.886129E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.706 | TFLOPs: 41.10 | 15: iteration 118200/ 125429 | consumed samples: 30259200 | consumed tokens: 61970841600 | elapsed time per iteration (s): 1.04 | learning rate: 2.150E-05 | global batch size: 256 | lm loss: 1.905299E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.365 | TFLOPs: 40.71 | 15: iteration 118210/ 125429 | consumed samples: 30261760 | consumed tokens: 61976084480 | elapsed time per iteration (s): 1.03 | learning rate: 2.150E-05 | global batch size: 256 | lm loss: 1.873099E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.330 | TFLOPs: 41.04 | 15: iteration 118220/ 125429 | consumed samples: 30264320 | consumed tokens: 61981327360 | elapsed time per iteration (s): 1.11 | learning rate: 2.149E-05 | global batch size: 256 | lm loss: 1.879498E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.121 | TFLOPs: 38.03 | 15: iteration 118230/ 125429 | consumed samples: 30266880 | consumed tokens: 61986570240 | elapsed time per iteration (s): 1.03 | learning rate: 2.149E-05 | global batch size: 256 | lm loss: 1.855612E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.458 | TFLOPs: 41.06 | 15: iteration 118240/ 125429 | consumed samples: 30269440 | consumed tokens: 61991813120 | elapsed time per iteration (s): 1.05 | learning rate: 2.148E-05 | global batch size: 256 | lm loss: 1.873467E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.752 | TFLOPs: 40.28 | 15: iteration 118250/ 125429 | consumed samples: 30272000 | consumed tokens: 61997056000 | elapsed time per iteration (s): 1.04 | learning rate: 2.148E-05 | global batch size: 256 | lm loss: 1.872685E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.353 | TFLOPs: 40.55 | 15: iteration 118260/ 125429 | consumed samples: 30274560 | consumed tokens: 62002298880 | elapsed time per iteration (s): 1.04 | learning rate: 2.148E-05 | global batch size: 256 | lm loss: 1.888882E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.965 | TFLOPs: 40.65 | 15: iteration 118270/ 125429 | consumed samples: 30277120 | consumed tokens: 62007541760 | elapsed time per iteration (s): 1.02 | learning rate: 2.147E-05 | global batch size: 256 | lm loss: 1.917354E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.417 | TFLOPs: 41.38 | 15: iteration 118280/ 125429 | consumed samples: 30279680 | consumed tokens: 62012784640 | elapsed time per iteration (s): 1.03 | learning rate: 2.147E-05 | global batch size: 256 | lm loss: 1.912526E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.453 | TFLOPs: 40.89 | 15: iteration 118290/ 125429 | consumed samples: 30282240 | consumed tokens: 62018027520 | elapsed time per iteration (s): 1.02 | learning rate: 2.146E-05 | global batch size: 256 | lm loss: 1.880479E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.167 | TFLOPs: 41.51 | 15: iteration 118300/ 125429 | consumed samples: 30284800 | consumed tokens: 62023270400 | elapsed time per iteration (s): 1.04 | learning rate: 2.146E-05 | global batch size: 256 | lm loss: 1.904180E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.877 | TFLOPs: 40.63 | 15: iteration 118310/ 125429 | consumed samples: 30287360 | consumed tokens: 62028513280 | elapsed time per iteration (s): 1.04 | learning rate: 2.146E-05 | global batch size: 256 | lm loss: 1.882673E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.903 | TFLOPs: 40.80 | 15: iteration 118320/ 125429 | consumed samples: 30289920 | consumed tokens: 62033756160 | elapsed time per iteration (s): 1.03 | learning rate: 2.145E-05 | global batch size: 256 | lm loss: 1.878466E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.483 | TFLOPs: 40.90 | 15: iteration 118330/ 125429 | consumed samples: 30292480 | consumed tokens: 62038999040 | elapsed time per iteration (s): 1.04 | learning rate: 2.145E-05 | global batch size: 256 | lm loss: 1.892886E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.903 | TFLOPs: 40.64 | 15: iteration 118340/ 125429 | consumed samples: 30295040 | consumed tokens: 62044241920 | elapsed time per iteration (s): 1.02 | learning rate: 2.144E-05 | global batch size: 256 | lm loss: 1.862209E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.821 | TFLOPs: 41.28 | 15: iteration 118350/ 125429 | consumed samples: 30297600 | consumed tokens: 62049484800 | elapsed time per iteration (s): 1.04 | learning rate: 2.144E-05 | global batch size: 256 | lm loss: 1.879152E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.539 | TFLOPs: 40.74 | 15: iteration 118360/ 125429 | consumed samples: 30300160 | consumed tokens: 62054727680 | elapsed time per iteration (s): 1.05 | learning rate: 2.144E-05 | global batch size: 256 | lm loss: 1.885136E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.638 | TFLOPs: 40.43 | 15: iteration 118370/ 125429 | consumed samples: 30302720 | consumed tokens: 62059970560 | elapsed time per iteration (s): 1.03 | learning rate: 2.143E-05 | global batch size: 256 | lm loss: 1.886636E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.545 | TFLOPs: 41.07 | 15: iteration 118380/ 125429 | consumed samples: 30305280 | consumed tokens: 62065213440 | elapsed time per iteration (s): 1.03 | learning rate: 2.143E-05 | global batch size: 256 | lm loss: 1.874510E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.105 | TFLOPs: 41.00 | 15: iteration 118390/ 125429 | consumed samples: 30307840 | consumed tokens: 62070456320 | elapsed time per iteration (s): 1.09 | learning rate: 2.142E-05 | global batch size: 256 | lm loss: 1.859034E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.967 | TFLOPs: 38.83 | 15: iteration 118400/ 125429 | consumed samples: 30310400 | consumed tokens: 62075699200 | elapsed time per iteration (s): 1.04 | learning rate: 2.142E-05 | global batch size: 256 | lm loss: 1.909817E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.652 | TFLOPs: 40.60 | 15: iteration 118410/ 125429 | consumed samples: 30312960 | consumed tokens: 62080942080 | elapsed time per iteration (s): 1.08 | learning rate: 2.142E-05 | global batch size: 256 | lm loss: 1.873392E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.653 | TFLOPs: 39.27 | 15: iteration 118420/ 125429 | consumed samples: 30315520 | consumed tokens: 62086184960 | elapsed time per iteration (s): 1.12 | learning rate: 2.141E-05 | global batch size: 256 | lm loss: 1.884263E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.361 | TFLOPs: 37.74 | 15: iteration 118430/ 125429 | consumed samples: 30318080 | consumed tokens: 62091427840 | elapsed time per iteration (s): 1.02 | learning rate: 2.141E-05 | global batch size: 256 | lm loss: 1.909970E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.039 | TFLOPs: 41.32 | 15: iteration 118440/ 125429 | consumed samples: 30320640 | consumed tokens: 62096670720 | elapsed time per iteration (s): 1.04 | learning rate: 2.140E-05 | global batch size: 256 | lm loss: 1.898674E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.794 | TFLOPs: 40.62 | 15: iteration 118450/ 125429 | consumed samples: 30323200 | consumed tokens: 62101913600 | elapsed time per iteration (s): 1.07 | learning rate: 2.140E-05 | global batch size: 256 | lm loss: 1.857860E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.045 | TFLOPs: 39.67 | 15: iteration 118460/ 125429 | consumed samples: 30325760 | consumed tokens: 62107156480 | elapsed time per iteration (s): 1.03 | learning rate: 2.140E-05 | global batch size: 256 | lm loss: 1.897429E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.212 | TFLOPs: 41.18 | 15: iteration 118470/ 125429 | consumed samples: 30328320 | consumed tokens: 62112399360 | elapsed time per iteration (s): 1.05 | learning rate: 2.139E-05 | global batch size: 256 | lm loss: 1.881032E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.453 | TFLOPs: 40.23 | 15: iteration 118480/ 125429 | consumed samples: 30330880 | consumed tokens: 62117642240 | elapsed time per iteration (s): 1.02 | learning rate: 2.139E-05 | global batch size: 256 | lm loss: 1.907307E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.678 | TFLOPs: 41.43 | 15: iteration 118490/ 125429 | consumed samples: 30333440 | consumed tokens: 62122885120 | elapsed time per iteration (s): 1.03 | learning rate: 2.138E-05 | global batch size: 256 | lm loss: 1.887002E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.972 | TFLOPs: 41.14 | 15: iteration 118500/ 125429 | consumed samples: 30336000 | consumed tokens: 62128128000 | elapsed time per iteration (s): 1.04 | learning rate: 2.138E-05 | global batch size: 256 | lm loss: 1.896513E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.264 | TFLOPs: 40.86 | 15: iteration 118510/ 125429 | consumed samples: 30338560 | consumed tokens: 62133370880 | elapsed time per iteration (s): 4.06 | learning rate: 2.138E-05 | global batch size: 256 | lm loss: 1.875049E+00 | grad norm: 0.152 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 63.074 | TFLOPs: 10.42 | 15: iteration 118520/ 125429 | consumed samples: 30341120 | consumed tokens: 62138613760 | elapsed time per iteration (s): 3.10 | learning rate: 2.137E-05 | global batch size: 256 | lm loss: 1.893460E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 82.519 | TFLOPs: 13.64 | 15: iteration 118530/ 125429 | consumed samples: 30343680 | consumed tokens: 62143856640 | elapsed time per iteration (s): 1.04 | learning rate: 2.137E-05 | global batch size: 256 | lm loss: 1.866386E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.267 | TFLOPs: 40.70 | 15: iteration 118540/ 125429 | consumed samples: 30346240 | consumed tokens: 62149099520 | elapsed time per iteration (s): 1.04 | learning rate: 2.136E-05 | global batch size: 256 | lm loss: 1.918917E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.342 | TFLOPs: 40.88 | 15: iteration 118550/ 125429 | consumed samples: 30348800 | consumed tokens: 62154342400 | elapsed time per iteration (s): 1.03 | learning rate: 2.136E-05 | global batch size: 256 | lm loss: 1.883387E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.869 | TFLOPs: 41.13 | 15: iteration 118560/ 125429 | consumed samples: 30351360 | consumed tokens: 62159585280 | elapsed time per iteration (s): 1.04 | learning rate: 2.136E-05 | global batch size: 256 | lm loss: 1.896454E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.545 | TFLOPs: 40.58 | 15: iteration 118570/ 125429 | consumed samples: 30353920 | consumed tokens: 62164828160 | elapsed time per iteration (s): 1.09 | learning rate: 2.135E-05 | global batch size: 256 | lm loss: 1.875346E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.839 | TFLOPs: 38.97 | 15: iteration 118580/ 125429 | consumed samples: 30356480 | consumed tokens: 62170071040 | elapsed time per iteration (s): 1.05 | learning rate: 2.135E-05 | global batch size: 256 | lm loss: 1.879965E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.827 | TFLOPs: 40.13 | 15: iteration 118590/ 125429 | consumed samples: 30359040 | consumed tokens: 62175313920 | elapsed time per iteration (s): 1.02 | learning rate: 2.134E-05 | global batch size: 256 | lm loss: 1.903732E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.127 | TFLOPs: 41.34 | 15: iteration 118600/ 125429 | consumed samples: 30361600 | consumed tokens: 62180556800 | elapsed time per iteration (s): 1.06 | learning rate: 2.134E-05 | global batch size: 256 | lm loss: 1.886769E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.706 | TFLOPs: 39.78 | 15: iteration 118610/ 125429 | consumed samples: 30364160 | consumed tokens: 62185799680 | elapsed time per iteration (s): 1.06 | learning rate: 2.134E-05 | global batch size: 256 | lm loss: 1.899526E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.721 | TFLOPs: 39.95 | 15: iteration 118620/ 125429 | consumed samples: 30366720 | consumed tokens: 62191042560 | elapsed time per iteration (s): 1.04 | learning rate: 2.133E-05 | global batch size: 256 | lm loss: 1.884480E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.757 | TFLOPs: 40.78 | 15: iteration 118630/ 125429 | consumed samples: 30369280 | consumed tokens: 62196285440 | elapsed time per iteration (s): 1.05 | learning rate: 2.133E-05 | global batch size: 256 | lm loss: 1.905280E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.855 | TFLOPs: 40.13 | 15: iteration 118640/ 125429 | consumed samples: 30371840 | consumed tokens: 62201528320 | elapsed time per iteration (s): 1.04 | learning rate: 2.132E-05 | global batch size: 256 | lm loss: 1.893872E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.138 | TFLOPs: 40.84 | 15: iteration 118650/ 125429 | consumed samples: 30374400 | consumed tokens: 62206771200 | elapsed time per iteration (s): 1.05 | learning rate: 2.132E-05 | global batch size: 256 | lm loss: 1.884840E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.829 | TFLOPs: 40.29 | 15: iteration 118660/ 125429 | consumed samples: 30376960 | consumed tokens: 62212014080 | elapsed time per iteration (s): 1.08 | learning rate: 2.132E-05 | global batch size: 256 | lm loss: 1.872754E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.180 | TFLOPs: 39.03 | 15: iteration 118670/ 125429 | consumed samples: 30379520 | consumed tokens: 62217256960 | elapsed time per iteration (s): 1.04 | learning rate: 2.131E-05 | global batch size: 256 | lm loss: 1.882477E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.841 | TFLOPs: 40.79 | 15: iteration 118680/ 125429 | consumed samples: 30382080 | consumed tokens: 62222499840 | elapsed time per iteration (s): 1.02 | learning rate: 2.131E-05 | global batch size: 256 | lm loss: 1.885702E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.624 | TFLOPs: 41.42 | 15: iteration 118690/ 125429 | consumed samples: 30384640 | consumed tokens: 62227742720 | elapsed time per iteration (s): 1.06 | learning rate: 2.130E-05 | global batch size: 256 | lm loss: 1.897202E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.561 | TFLOPs: 39.92 | 15: iteration 118700/ 125429 | consumed samples: 30387200 | consumed tokens: 62232985600 | elapsed time per iteration (s): 1.04 | learning rate: 2.130E-05 | global batch size: 256 | lm loss: 1.903946E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.450 | TFLOPs: 40.56 | 15: iteration 118710/ 125429 | consumed samples: 30389760 | consumed tokens: 62238228480 | elapsed time per iteration (s): 1.03 | learning rate: 2.130E-05 | global batch size: 256 | lm loss: 1.902082E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.622 | TFLOPs: 41.25 | 15: iteration 118720/ 125429 | consumed samples: 30392320 | consumed tokens: 62243471360 | elapsed time per iteration (s): 1.04 | learning rate: 2.129E-05 | global batch size: 256 | lm loss: 1.878841E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.922 | TFLOPs: 40.64 | 15: iteration 118730/ 125429 | consumed samples: 30394880 | consumed tokens: 62248714240 | elapsed time per iteration (s): 1.04 | learning rate: 2.129E-05 | global batch size: 256 | lm loss: 1.875951E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.153 | TFLOPs: 40.84 | 15: iteration 118740/ 125429 | consumed samples: 30397440 | consumed tokens: 62253957120 | elapsed time per iteration (s): 1.04 | learning rate: 2.129E-05 | global batch size: 256 | lm loss: 1.898575E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.645 | TFLOPs: 40.59 | 15: iteration 118750/ 125429 | consumed samples: 30400000 | consumed tokens: 62259200000 | elapsed time per iteration (s): 1.04 | learning rate: 2.128E-05 | global batch size: 256 | lm loss: 1.863766E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.156 | TFLOPs: 40.84 | 15: iteration 118760/ 125429 | consumed samples: 30402560 | consumed tokens: 62264442880 | elapsed time per iteration (s): 1.03 | learning rate: 2.128E-05 | global batch size: 256 | lm loss: 1.883166E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.359 | TFLOPs: 41.04 | 15: iteration 118770/ 125429 | consumed samples: 30405120 | consumed tokens: 62269685760 | elapsed time per iteration (s): 1.05 | learning rate: 2.127E-05 | global batch size: 256 | lm loss: 1.867477E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.435 | TFLOPs: 40.23 | 15: iteration 118780/ 125429 | consumed samples: 30407680 | consumed tokens: 62274928640 | elapsed time per iteration (s): 1.03 | learning rate: 2.127E-05 | global batch size: 256 | lm loss: 1.877449E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.417 | TFLOPs: 41.05 | 15: iteration 118790/ 125429 | consumed samples: 30410240 | consumed tokens: 62280171520 | elapsed time per iteration (s): 1.03 | learning rate: 2.127E-05 | global batch size: 256 | lm loss: 1.916958E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.805 | TFLOPs: 41.12 | 15: iteration 118800/ 125429 | consumed samples: 30412800 | consumed tokens: 62285414400 | elapsed time per iteration (s): 1.10 | learning rate: 2.126E-05 | global batch size: 256 | lm loss: 1.885173E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.141 | TFLOPs: 38.36 | 15: iteration 118810/ 125429 | consumed samples: 30415360 | consumed tokens: 62290657280 | elapsed time per iteration (s): 1.05 | learning rate: 2.126E-05 | global batch size: 256 | lm loss: 1.896015E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.367 | TFLOPs: 40.38 | 15: iteration 118820/ 125429 | consumed samples: 30417920 | consumed tokens: 62295900160 | elapsed time per iteration (s): 1.04 | learning rate: 2.126E-05 | global batch size: 256 | lm loss: 1.898600E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.213 | TFLOPs: 40.69 | 15: iteration 118830/ 125429 | consumed samples: 30420480 | consumed tokens: 62301143040 | elapsed time per iteration (s): 1.03 | learning rate: 2.125E-05 | global batch size: 256 | lm loss: 1.909628E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.034 | TFLOPs: 41.15 | 15: iteration 118840/ 125429 | consumed samples: 30423040 | consumed tokens: 62306385920 | elapsed time per iteration (s): 1.02 | learning rate: 2.125E-05 | global batch size: 256 | lm loss: 1.890718E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.788 | TFLOPs: 41.28 | 15: iteration 118850/ 125429 | consumed samples: 30425600 | consumed tokens: 62311628800 | elapsed time per iteration (s): 1.02 | learning rate: 2.124E-05 | global batch size: 256 | lm loss: 1.901217E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.001 | TFLOPs: 41.31 | 15: iteration 118860/ 125429 | consumed samples: 30428160 | consumed tokens: 62316871680 | elapsed time per iteration (s): 1.02 | learning rate: 2.124E-05 | global batch size: 256 | lm loss: 1.886349E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.594 | TFLOPs: 41.58 | 15: iteration 118870/ 125429 | consumed samples: 30430720 | consumed tokens: 62322114560 | elapsed time per iteration (s): 1.03 | learning rate: 2.124E-05 | global batch size: 256 | lm loss: 1.911826E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.579 | TFLOPs: 41.24 | 15: iteration 118880/ 125429 | consumed samples: 30433280 | consumed tokens: 62327357440 | elapsed time per iteration (s): 1.02 | learning rate: 2.123E-05 | global batch size: 256 | lm loss: 1.884482E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.400 | TFLOPs: 41.55 | 15: iteration 118890/ 125429 | consumed samples: 30435840 | consumed tokens: 62332600320 | elapsed time per iteration (s): 1.03 | learning rate: 2.123E-05 | global batch size: 256 | lm loss: 1.902830E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.343 | TFLOPs: 41.21 | 15: iteration 118900/ 125429 | consumed samples: 30438400 | consumed tokens: 62337843200 | elapsed time per iteration (s): 1.07 | learning rate: 2.123E-05 | global batch size: 256 | lm loss: 1.907168E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.932 | TFLOPs: 39.49 | 15: iteration 118910/ 125429 | consumed samples: 30440960 | consumed tokens: 62343086080 | elapsed time per iteration (s): 1.04 | learning rate: 2.122E-05 | global batch size: 256 | lm loss: 1.875391E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.972 | TFLOPs: 40.65 | 15: iteration 118920/ 125429 | consumed samples: 30443520 | consumed tokens: 62348328960 | elapsed time per iteration (s): 1.04 | learning rate: 2.122E-05 | global batch size: 256 | lm loss: 1.900895E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.083 | TFLOPs: 40.67 | 15: iteration 118930/ 125429 | consumed samples: 30446080 | consumed tokens: 62353571840 | elapsed time per iteration (s): 1.04 | learning rate: 2.121E-05 | global batch size: 256 | lm loss: 1.886972E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.953 | TFLOPs: 40.81 | 15: iteration 118940/ 125429 | consumed samples: 30448640 | consumed tokens: 62358814720 | elapsed time per iteration (s): 1.02 | learning rate: 2.121E-05 | global batch size: 256 | lm loss: 1.880801E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.312 | TFLOPs: 41.53 | 15: iteration 118950/ 125429 | consumed samples: 30451200 | consumed tokens: 62364057600 | elapsed time per iteration (s): 1.03 | learning rate: 2.121E-05 | global batch size: 256 | lm loss: 1.876896E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.976 | TFLOPs: 41.15 | 15: iteration 118960/ 125429 | consumed samples: 30453760 | consumed tokens: 62369300480 | elapsed time per iteration (s): 1.03 | learning rate: 2.120E-05 | global batch size: 256 | lm loss: 1.888295E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.272 | TFLOPs: 41.19 | 15: iteration 118970/ 125429 | consumed samples: 30456320 | consumed tokens: 62374543360 | elapsed time per iteration (s): 1.05 | learning rate: 2.120E-05 | global batch size: 256 | lm loss: 1.884873E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.889 | TFLOPs: 40.47 | 15: iteration 118980/ 125429 | consumed samples: 30458880 | consumed tokens: 62379786240 | elapsed time per iteration (s): 1.03 | learning rate: 2.120E-05 | global batch size: 256 | lm loss: 1.867676E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.820 | TFLOPs: 40.95 | 15: iteration 118990/ 125429 | consumed samples: 30461440 | consumed tokens: 62385029120 | elapsed time per iteration (s): 1.03 | learning rate: 2.119E-05 | global batch size: 256 | lm loss: 1.881359E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.946 | TFLOPs: 40.97 | 15: iteration 119000/ 125429 | consumed samples: 30464000 | consumed tokens: 62390272000 | elapsed time per iteration (s): 1.02 | learning rate: 2.119E-05 | global batch size: 256 | lm loss: 1.879476E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.860 | TFLOPs: 41.29 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 119000 | lm loss value: 1.823208E+00 | lm loss PPL: 6.191692E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 119000 to checkpoints_1b5 0: [2022-11-27 07:15:35,454] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step119000 is begin to save! 0: [2022-11-27 07:15:35,460] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:15:35,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:15:35,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:15:35,836] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:15:35,837] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:15:35,951] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:15:35,951] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:15:36,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:15:36,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:15:36,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:15:36,182] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:15:36,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:15:36,294] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:15:36,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:15:36,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:15:36,516] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:15:36,517] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:15:36,626] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:15:36,626] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:15:36,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:15:36,737] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:15:36,846] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:15:36,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:15:36,954] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:15:36,955] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:15:37,064] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:15:37,065] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:15:37,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:15:37,174] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:15:37,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:15:37,284] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:15:37,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:15:37,392] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:15:37,500] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:15:37,500] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:15:37,606] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:15:37,607] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:15:37,714] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:15:37,714] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:15:37,821] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:15:37,821] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:15:37,929] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:15:37,929] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:15:38,034] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:15:38,034] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:15:38,141] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:15:38,142] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:15:38,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:15:38,249] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:15:38,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:15:38,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:15:38,461] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:15:38,461] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:15:38,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:15:38,566] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_29-model_00-model_states.pt... 0: [2022-11-27 07:15:38,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_29-model_00-model_states.pt. 0: [2022-11-27 07:15:38,673] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:15:38,778] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:15:38,779] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/layer_32-model_00-model_states.pt... 0: [2022-11-27 07:15:38,783] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/layer_32-model_00-model_states.pt. 0: [2022-11-27 07:15:38,785] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step119000/mp_rank_00_model_states.pt 0: [2022-11-27 07:15:38,785] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:15:38,789] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:15:38,828] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step119000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:15:38,993] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:15:38,993] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:38,993] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:38,995] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:15:38,995] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:38,995] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:38,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:38,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:38,996] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:38,996] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:15:38,996] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:38,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:38,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:38,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:38,997] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:38,997] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:15:38,997] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:38,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,001] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,001] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 07:15:39,001] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:15:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:15:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,002] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,002] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:39,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:39,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:39,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,003] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 07:15:39,003] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,004] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 07:15:39,004] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:15:39,006] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:39,006] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,006] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:15:39,007] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:39,007] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:39,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:39,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:39,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,009] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,009] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:15:39,009] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:38,998] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:38,998] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:38,998] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 07:15:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 07:15:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:39,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:39,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:39,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-27 07:15:39,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 07:15:39,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-27 07:15:39,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,017] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 07:15:39,017] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-27 07:15:39,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 07:15:39,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-27 07:15:39,018] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,018] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 07:15:39,018] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 07:15:39,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:15:39,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:15:39,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,020] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,020] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:39,011] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:39,011] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:39,011] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:39,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:39,012] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:39,012] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 07:15:39,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-27 07:15:39,022] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,022] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 07:15:39,022] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,008] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,008] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 07:15:39,008] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:39,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:39,021] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:39,021] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:39,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:39,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:39,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:39,024] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:39,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:39,024] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,027] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,027] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 07:15:39,027] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:15:39,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 8: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:15:39,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,028] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,028] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,029] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,029] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:15:39,029] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 07:15:39,030] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-27 07:15:39,030] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,030] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 6: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 2: [2022-11-27 07:15:39,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 2: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 3: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:15:39,031] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 07:15:39,031] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 14: [2022-11-27 07:15:39,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,032] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,032] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 07:15:39,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 14: [2022-11-27 07:15:39,036] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:15:39,036] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 07:15:39,037] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,038] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,038] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 07:15:39,038] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 07:15:39,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 9: [2022-11-27 07:15:39,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:15:39,039] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:15:39,039] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:39,019] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,019] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,033] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:15:39,033] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:39,033] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,042] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:15:39,042] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:39,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 11: [2022-11-27 07:15:39,043] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:15:39,043] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 07:15:39,043] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:39,055] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:15:39,056] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:39,056] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:39,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,059] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-27 07:15:39,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 7: [2022-11-27 07:15:39,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:15:39,060] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 07:15:39,060] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 12: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:39,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:39,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:39,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:39,058] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:39,058] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:39,058] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:39,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:39,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:15:39,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 4: [2022-11-27 07:15:39,078] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 07:15:39,078] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,066] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,066] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,066] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,067] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,074] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,074] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,074] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:15:39,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,086] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 13: [2022-11-27 07:15:39,086] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,152] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,153] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,153] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,153] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,159] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,159] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,159] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,160] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: [2022-11-27 07:15:39,174] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 07:15:39,174] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 10: [2022-11-27 07:15:39,175] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:15:39,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 07:15:39,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 15: [2022-11-27 07:15:39,213] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 1: [2022-11-27 07:15:39,261] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:15:39,261] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 07:15:39,261] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:15:39,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 5: [2022-11-27 07:15:39,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step119000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 07:15:39,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step119000 is ready now! 0: successfully saved checkpoint at iteration 119000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3841.19 15: iteration 119010/ 125429 | consumed samples: 30466560 | consumed tokens: 62395514880 | elapsed time per iteration (s): 1.46 | learning rate: 2.118E-05 | global batch size: 256 | lm loss: 1.882604E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 175.183 | TFLOPs: 28.95 | 15: iteration 119020/ 125429 | consumed samples: 30469120 | consumed tokens: 62400757760 | elapsed time per iteration (s): 1.07 | learning rate: 2.118E-05 | global batch size: 256 | lm loss: 1.896758E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.527 | TFLOPs: 39.42 | 15: iteration 119030/ 125429 | consumed samples: 30471680 | consumed tokens: 62406000640 | elapsed time per iteration (s): 1.05 | learning rate: 2.118E-05 | global batch size: 256 | lm loss: 1.907685E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.211 | TFLOPs: 40.36 | 15: iteration 119040/ 125429 | consumed samples: 30474240 | consumed tokens: 62411243520 | elapsed time per iteration (s): 1.05 | learning rate: 2.117E-05 | global batch size: 256 | lm loss: 1.889001E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.034 | TFLOPs: 40.33 | 15: iteration 119050/ 125429 | consumed samples: 30476800 | consumed tokens: 62416486400 | elapsed time per iteration (s): 1.03 | learning rate: 2.117E-05 | global batch size: 256 | lm loss: 1.885517E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.571 | TFLOPs: 41.08 | 15: iteration 119060/ 125429 | consumed samples: 30479360 | consumed tokens: 62421729280 | elapsed time per iteration (s): 1.06 | learning rate: 2.117E-05 | global batch size: 256 | lm loss: 1.896782E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.775 | TFLOPs: 39.79 | 15: iteration 119070/ 125429 | consumed samples: 30481920 | consumed tokens: 62426972160 | elapsed time per iteration (s): 1.04 | learning rate: 2.116E-05 | global batch size: 256 | lm loss: 1.873628E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.256 | TFLOPs: 40.86 | 15: iteration 119080/ 125429 | consumed samples: 30484480 | consumed tokens: 62432215040 | elapsed time per iteration (s): 1.07 | learning rate: 2.116E-05 | global batch size: 256 | lm loss: 1.893167E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.521 | TFLOPs: 39.58 | 15: iteration 119090/ 125429 | consumed samples: 30487040 | consumed tokens: 62437457920 | elapsed time per iteration (s): 1.07 | learning rate: 2.115E-05 | global batch size: 256 | lm loss: 1.879849E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.632 | TFLOPs: 39.44 | 15: iteration 119100/ 125429 | consumed samples: 30489600 | consumed tokens: 62442700800 | elapsed time per iteration (s): 1.07 | learning rate: 2.115E-05 | global batch size: 256 | lm loss: 1.902238E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.107 | TFLOPs: 39.68 | 15: iteration 119110/ 125429 | consumed samples: 30492160 | consumed tokens: 62447943680 | elapsed time per iteration (s): 1.06 | learning rate: 2.115E-05 | global batch size: 256 | lm loss: 1.898051E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.081 | TFLOPs: 40.01 | 15: iteration 119120/ 125429 | consumed samples: 30494720 | consumed tokens: 62453186560 | elapsed time per iteration (s): 1.06 | learning rate: 2.114E-05 | global batch size: 256 | lm loss: 1.914393E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.508 | TFLOPs: 39.91 | 15: iteration 119130/ 125429 | consumed samples: 30497280 | consumed tokens: 62458429440 | elapsed time per iteration (s): 1.05 | learning rate: 2.114E-05 | global batch size: 256 | lm loss: 1.911976E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.505 | TFLOPs: 40.24 | 15: iteration 119140/ 125429 | consumed samples: 30499840 | consumed tokens: 62463672320 | elapsed time per iteration (s): 1.03 | learning rate: 2.114E-05 | global batch size: 256 | lm loss: 1.875280E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.088 | TFLOPs: 41.00 | 15: iteration 119150/ 125429 | consumed samples: 30502400 | consumed tokens: 62468915200 | elapsed time per iteration (s): 1.05 | learning rate: 2.113E-05 | global batch size: 256 | lm loss: 1.873324E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.343 | TFLOPs: 40.21 | 15: iteration 119160/ 125429 | consumed samples: 30504960 | consumed tokens: 62474158080 | elapsed time per iteration (s): 1.05 | learning rate: 2.113E-05 | global batch size: 256 | lm loss: 1.875270E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.862 | TFLOPs: 40.47 | 15: iteration 119170/ 125429 | consumed samples: 30507520 | consumed tokens: 62479400960 | elapsed time per iteration (s): 1.05 | learning rate: 2.113E-05 | global batch size: 256 | lm loss: 1.908304E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.971 | TFLOPs: 40.32 | 15: iteration 119180/ 125429 | consumed samples: 30510080 | consumed tokens: 62484643840 | elapsed time per iteration (s): 1.06 | learning rate: 2.112E-05 | global batch size: 256 | lm loss: 1.878432E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.819 | TFLOPs: 39.80 | 15: iteration 119190/ 125429 | consumed samples: 30512640 | consumed tokens: 62489886720 | elapsed time per iteration (s): 1.05 | learning rate: 2.112E-05 | global batch size: 256 | lm loss: 1.883370E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.673 | TFLOPs: 40.27 | 15: iteration 119200/ 125429 | consumed samples: 30515200 | consumed tokens: 62495129600 | elapsed time per iteration (s): 1.06 | learning rate: 2.112E-05 | global batch size: 256 | lm loss: 1.881902E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.690 | TFLOPs: 39.78 | 15: iteration 119210/ 125429 | consumed samples: 30517760 | consumed tokens: 62500372480 | elapsed time per iteration (s): 1.04 | learning rate: 2.111E-05 | global batch size: 256 | lm loss: 1.891450E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.163 | TFLOPs: 40.52 | 15: iteration 119220/ 125429 | consumed samples: 30520320 | consumed tokens: 62505615360 | elapsed time per iteration (s): 1.05 | learning rate: 2.111E-05 | global batch size: 256 | lm loss: 1.883806E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.552 | TFLOPs: 40.41 | 15: iteration 119230/ 125429 | consumed samples: 30522880 | consumed tokens: 62510858240 | elapsed time per iteration (s): 1.07 | learning rate: 2.110E-05 | global batch size: 256 | lm loss: 1.878056E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.564 | TFLOPs: 39.59 | 15: iteration 119240/ 125429 | consumed samples: 30525440 | consumed tokens: 62516101120 | elapsed time per iteration (s): 1.03 | learning rate: 2.110E-05 | global batch size: 256 | lm loss: 1.908171E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.534 | TFLOPs: 41.07 | 15: iteration 119250/ 125429 | consumed samples: 30528000 | consumed tokens: 62521344000 | elapsed time per iteration (s): 1.06 | learning rate: 2.110E-05 | global batch size: 256 | lm loss: 1.876646E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.114 | TFLOPs: 39.85 | 15: iteration 119260/ 125429 | consumed samples: 30530560 | consumed tokens: 62526586880 | elapsed time per iteration (s): 1.03 | learning rate: 2.109E-05 | global batch size: 256 | lm loss: 1.880765E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.140 | TFLOPs: 41.17 | 15: iteration 119270/ 125429 | consumed samples: 30533120 | consumed tokens: 62531829760 | elapsed time per iteration (s): 1.05 | learning rate: 2.109E-05 | global batch size: 256 | lm loss: 1.884174E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.404 | TFLOPs: 40.39 | 15: iteration 119280/ 125429 | consumed samples: 30535680 | consumed tokens: 62537072640 | elapsed time per iteration (s): 1.04 | learning rate: 2.109E-05 | global batch size: 256 | lm loss: 1.879928E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.967 | TFLOPs: 40.81 | 15: iteration 119290/ 125429 | consumed samples: 30538240 | consumed tokens: 62542315520 | elapsed time per iteration (s): 1.05 | learning rate: 2.108E-05 | global batch size: 256 | lm loss: 1.922291E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.088 | TFLOPs: 40.17 | 15: iteration 119300/ 125429 | consumed samples: 30540800 | consumed tokens: 62547558400 | elapsed time per iteration (s): 1.19 | learning rate: 2.108E-05 | global batch size: 256 | lm loss: 1.902588E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.588 | TFLOPs: 35.46 | 15: iteration 119310/ 125429 | consumed samples: 30543360 | consumed tokens: 62552801280 | elapsed time per iteration (s): 1.03 | learning rate: 2.108E-05 | global batch size: 256 | lm loss: 1.888309E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.438 | TFLOPs: 41.06 | 15: iteration 119320/ 125429 | consumed samples: 30545920 | consumed tokens: 62558044160 | elapsed time per iteration (s): 1.04 | learning rate: 2.107E-05 | global batch size: 256 | lm loss: 1.867383E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.289 | TFLOPs: 40.54 | 15: iteration 119330/ 125429 | consumed samples: 30548480 | consumed tokens: 62563287040 | elapsed time per iteration (s): 1.03 | learning rate: 2.107E-05 | global batch size: 256 | lm loss: 1.880216E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.706 | TFLOPs: 41.10 | 15: iteration 119340/ 125429 | consumed samples: 30551040 | consumed tokens: 62568529920 | elapsed time per iteration (s): 1.20 | learning rate: 2.107E-05 | global batch size: 256 | lm loss: 1.888904E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.556 | TFLOPs: 35.13 | 15: iteration 119350/ 125429 | consumed samples: 30553600 | consumed tokens: 62573772800 | elapsed time per iteration (s): 1.33 | learning rate: 2.106E-05 | global batch size: 256 | lm loss: 1.900286E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 192.686 | TFLOPs: 31.84 | 15: iteration 119360/ 125429 | consumed samples: 30556160 | consumed tokens: 62579015680 | elapsed time per iteration (s): 1.06 | learning rate: 2.106E-05 | global batch size: 256 | lm loss: 1.886826E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.083 | TFLOPs: 39.84 | 15: iteration 119370/ 125429 | consumed samples: 30558720 | consumed tokens: 62584258560 | elapsed time per iteration (s): 1.03 | learning rate: 2.106E-05 | global batch size: 256 | lm loss: 1.891426E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.482 | TFLOPs: 41.06 | 15: iteration 119380/ 125429 | consumed samples: 30561280 | consumed tokens: 62589501440 | elapsed time per iteration (s): 1.08 | learning rate: 2.105E-05 | global batch size: 256 | lm loss: 1.877777E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.758 | TFLOPs: 39.13 | 15: iteration 119390/ 125429 | consumed samples: 30563840 | consumed tokens: 62594744320 | elapsed time per iteration (s): 1.06 | learning rate: 2.105E-05 | global batch size: 256 | lm loss: 1.895334E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.947 | TFLOPs: 39.82 | 15: iteration 119400/ 125429 | consumed samples: 30566400 | consumed tokens: 62599987200 | elapsed time per iteration (s): 1.05 | learning rate: 2.104E-05 | global batch size: 256 | lm loss: 1.854439E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.684 | TFLOPs: 40.11 | 15: iteration 119410/ 125429 | consumed samples: 30568960 | consumed tokens: 62605230080 | elapsed time per iteration (s): 1.05 | learning rate: 2.104E-05 | global batch size: 256 | lm loss: 1.914548E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.170 | TFLOPs: 40.35 | 15: iteration 119420/ 125429 | consumed samples: 30571520 | consumed tokens: 62610472960 | elapsed time per iteration (s): 1.02 | learning rate: 2.104E-05 | global batch size: 256 | lm loss: 1.904319E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.849 | TFLOPs: 41.62 | 15: iteration 119430/ 125429 | consumed samples: 30574080 | consumed tokens: 62615715840 | elapsed time per iteration (s): 1.02 | learning rate: 2.103E-05 | global batch size: 256 | lm loss: 1.904051E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.103 | TFLOPs: 41.33 | 15: iteration 119440/ 125429 | consumed samples: 30576640 | consumed tokens: 62620958720 | elapsed time per iteration (s): 1.05 | learning rate: 2.103E-05 | global batch size: 256 | lm loss: 1.862466E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.664 | TFLOPs: 40.27 | 15: iteration 119450/ 125429 | consumed samples: 30579200 | consumed tokens: 62626201600 | elapsed time per iteration (s): 1.06 | learning rate: 2.103E-05 | global batch size: 256 | lm loss: 1.876438E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.197 | TFLOPs: 39.86 | 15: iteration 119460/ 125429 | consumed samples: 30581760 | consumed tokens: 62631444480 | elapsed time per iteration (s): 1.02 | learning rate: 2.102E-05 | global batch size: 256 | lm loss: 1.894799E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.716 | TFLOPs: 41.43 | 15: iteration 119470/ 125429 | consumed samples: 30584320 | consumed tokens: 62636687360 | elapsed time per iteration (s): 1.08 | learning rate: 2.102E-05 | global batch size: 256 | lm loss: 1.884286E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.698 | TFLOPs: 39.12 | 15: iteration 119480/ 125429 | consumed samples: 30586880 | consumed tokens: 62641930240 | elapsed time per iteration (s): 1.05 | learning rate: 2.102E-05 | global batch size: 256 | lm loss: 1.901687E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.720 | TFLOPs: 40.11 | 15: iteration 119490/ 125429 | consumed samples: 30589440 | consumed tokens: 62647173120 | elapsed time per iteration (s): 1.05 | learning rate: 2.101E-05 | global batch size: 256 | lm loss: 1.884527E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.629 | TFLOPs: 40.26 | 15: iteration 119500/ 125429 | consumed samples: 30592000 | consumed tokens: 62652416000 | elapsed time per iteration (s): 1.03 | learning rate: 2.101E-05 | global batch size: 256 | lm loss: 1.877350E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.816 | TFLOPs: 40.95 | 15: iteration 119510/ 125429 | consumed samples: 30594560 | consumed tokens: 62657658880 | elapsed time per iteration (s): 1.07 | learning rate: 2.101E-05 | global batch size: 256 | lm loss: 1.905564E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.022 | TFLOPs: 39.50 | 15: iteration 119520/ 125429 | consumed samples: 30597120 | consumed tokens: 62662901760 | elapsed time per iteration (s): 1.04 | learning rate: 2.100E-05 | global batch size: 256 | lm loss: 1.896184E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.301 | TFLOPs: 40.70 | 15: iteration 119530/ 125429 | consumed samples: 30599680 | consumed tokens: 62668144640 | elapsed time per iteration (s): 1.03 | learning rate: 2.100E-05 | global batch size: 256 | lm loss: 1.886695E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.877 | TFLOPs: 40.96 | 15: iteration 119540/ 125429 | consumed samples: 30602240 | consumed tokens: 62673387520 | elapsed time per iteration (s): 1.04 | learning rate: 2.100E-05 | global batch size: 256 | lm loss: 1.901581E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.119 | TFLOPs: 40.84 | 15: iteration 119550/ 125429 | consumed samples: 30604800 | consumed tokens: 62678630400 | elapsed time per iteration (s): 1.07 | learning rate: 2.099E-05 | global batch size: 256 | lm loss: 1.888134E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.744 | TFLOPs: 39.62 | 15: iteration 119560/ 125429 | consumed samples: 30607360 | consumed tokens: 62683873280 | elapsed time per iteration (s): 1.04 | learning rate: 2.099E-05 | global batch size: 256 | lm loss: 1.883236E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.978 | TFLOPs: 40.81 | 15: iteration 119570/ 125429 | consumed samples: 30609920 | consumed tokens: 62689116160 | elapsed time per iteration (s): 1.03 | learning rate: 2.099E-05 | global batch size: 256 | lm loss: 1.878155E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.498 | TFLOPs: 41.23 | 15: iteration 119580/ 125429 | consumed samples: 30612480 | consumed tokens: 62694359040 | elapsed time per iteration (s): 1.06 | learning rate: 2.098E-05 | global batch size: 256 | lm loss: 1.870295E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.348 | TFLOPs: 39.88 | 15: iteration 119590/ 125429 | consumed samples: 30615040 | consumed tokens: 62699601920 | elapsed time per iteration (s): 1.07 | learning rate: 2.098E-05 | global batch size: 256 | lm loss: 1.904412E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.309 | TFLOPs: 39.55 | 15: iteration 119600/ 125429 | consumed samples: 30617600 | consumed tokens: 62704844800 | elapsed time per iteration (s): 1.06 | learning rate: 2.098E-05 | global batch size: 256 | lm loss: 1.878661E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.322 | TFLOPs: 39.88 | 15: iteration 119610/ 125429 | consumed samples: 30620160 | consumed tokens: 62710087680 | elapsed time per iteration (s): 1.05 | learning rate: 2.097E-05 | global batch size: 256 | lm loss: 1.896880E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.921 | TFLOPs: 40.48 | 15: iteration 119620/ 125429 | consumed samples: 30622720 | consumed tokens: 62715330560 | elapsed time per iteration (s): 1.04 | learning rate: 2.097E-05 | global batch size: 256 | lm loss: 1.912138E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.082 | TFLOPs: 40.83 | 15: iteration 119630/ 125429 | consumed samples: 30625280 | consumed tokens: 62720573440 | elapsed time per iteration (s): 1.11 | learning rate: 2.097E-05 | global batch size: 256 | lm loss: 1.892960E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.585 | TFLOPs: 38.27 | 15: iteration 119640/ 125429 | consumed samples: 30627840 | consumed tokens: 62725816320 | elapsed time per iteration (s): 1.04 | learning rate: 2.096E-05 | global batch size: 256 | lm loss: 1.898229E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.702 | TFLOPs: 40.60 | 15: iteration 119650/ 125429 | consumed samples: 30630400 | consumed tokens: 62731059200 | elapsed time per iteration (s): 1.04 | learning rate: 2.096E-05 | global batch size: 256 | lm loss: 1.905180E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.685 | TFLOPs: 40.77 | 15: iteration 119660/ 125429 | consumed samples: 30632960 | consumed tokens: 62736302080 | elapsed time per iteration (s): 1.08 | learning rate: 2.096E-05 | global batch size: 256 | lm loss: 1.898610E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.082 | TFLOPs: 39.34 | 15: iteration 119670/ 125429 | consumed samples: 30635520 | consumed tokens: 62741544960 | elapsed time per iteration (s): 1.05 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.876169E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.302 | TFLOPs: 40.37 | 15: iteration 119680/ 125429 | consumed samples: 30638080 | consumed tokens: 62746787840 | elapsed time per iteration (s): 1.04 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.893623E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.971 | TFLOPs: 40.65 | 15: iteration 119690/ 125429 | consumed samples: 30640640 | consumed tokens: 62752030720 | elapsed time per iteration (s): 1.04 | learning rate: 2.095E-05 | global batch size: 256 | lm loss: 1.909625E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.519 | TFLOPs: 40.57 | 15: iteration 119700/ 125429 | consumed samples: 30643200 | consumed tokens: 62757273600 | elapsed time per iteration (s): 1.06 | learning rate: 2.094E-05 | global batch size: 256 | lm loss: 1.915642E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.848 | TFLOPs: 39.80 | 15: iteration 119710/ 125429 | consumed samples: 30645760 | consumed tokens: 62762516480 | elapsed time per iteration (s): 1.06 | learning rate: 2.094E-05 | global batch size: 256 | lm loss: 1.879740E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.223 | TFLOPs: 40.03 | 15: iteration 119720/ 125429 | consumed samples: 30648320 | consumed tokens: 62767759360 | elapsed time per iteration (s): 1.03 | learning rate: 2.094E-05 | global batch size: 256 | lm loss: 1.907281E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.607 | TFLOPs: 40.92 | 15: iteration 119730/ 125429 | consumed samples: 30650880 | consumed tokens: 62773002240 | elapsed time per iteration (s): 1.04 | learning rate: 2.093E-05 | global batch size: 256 | lm loss: 1.888392E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.491 | TFLOPs: 40.73 | 15: iteration 119740/ 125429 | consumed samples: 30653440 | consumed tokens: 62778245120 | elapsed time per iteration (s): 1.06 | learning rate: 2.093E-05 | global batch size: 256 | lm loss: 1.905417E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.526 | TFLOPs: 40.08 | 15: iteration 119750/ 125429 | consumed samples: 30656000 | consumed tokens: 62783488000 | elapsed time per iteration (s): 1.03 | learning rate: 2.093E-05 | global batch size: 256 | lm loss: 1.896495E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.046 | TFLOPs: 41.16 | 15: iteration 119760/ 125429 | consumed samples: 30658560 | consumed tokens: 62788730880 | elapsed time per iteration (s): 1.04 | learning rate: 2.092E-05 | global batch size: 256 | lm loss: 1.886560E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.152 | TFLOPs: 40.51 | 15: iteration 119770/ 125429 | consumed samples: 30661120 | consumed tokens: 62793973760 | elapsed time per iteration (s): 1.03 | learning rate: 2.092E-05 | global batch size: 256 | lm loss: 1.881005E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.009 | TFLOPs: 41.15 | 15: iteration 119780/ 125429 | consumed samples: 30663680 | consumed tokens: 62799216640 | elapsed time per iteration (s): 1.03 | learning rate: 2.092E-05 | global batch size: 256 | lm loss: 1.878805E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.459 | TFLOPs: 41.06 | 15: iteration 119790/ 125429 | consumed samples: 30666240 | consumed tokens: 62804459520 | elapsed time per iteration (s): 1.04 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.901050E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.769 | TFLOPs: 40.78 | 15: iteration 119800/ 125429 | consumed samples: 30668800 | consumed tokens: 62809702400 | elapsed time per iteration (s): 1.09 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.866417E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.492 | TFLOPs: 38.92 | 15: iteration 119810/ 125429 | consumed samples: 30671360 | consumed tokens: 62814945280 | elapsed time per iteration (s): 1.06 | learning rate: 2.091E-05 | global batch size: 256 | lm loss: 1.884506E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.389 | TFLOPs: 39.89 | 15: iteration 119820/ 125429 | consumed samples: 30673920 | consumed tokens: 62820188160 | elapsed time per iteration (s): 1.08 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.906947E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.471 | TFLOPs: 39.24 | 15: iteration 119830/ 125429 | consumed samples: 30676480 | consumed tokens: 62825431040 | elapsed time per iteration (s): 1.08 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.901121E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.208 | TFLOPs: 39.20 | 15: iteration 119840/ 125429 | consumed samples: 30679040 | consumed tokens: 62830673920 | elapsed time per iteration (s): 1.04 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.861677E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.637 | TFLOPs: 40.76 | 15: iteration 119850/ 125429 | consumed samples: 30681600 | consumed tokens: 62835916800 | elapsed time per iteration (s): 1.06 | learning rate: 2.090E-05 | global batch size: 256 | lm loss: 1.887636E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.987 | TFLOPs: 39.99 | 15: iteration 119860/ 125429 | consumed samples: 30684160 | consumed tokens: 62841159680 | elapsed time per iteration (s): 1.03 | learning rate: 2.089E-05 | global batch size: 256 | lm loss: 1.891825E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.683 | TFLOPs: 40.93 | 15: iteration 119870/ 125429 | consumed samples: 30686720 | consumed tokens: 62846402560 | elapsed time per iteration (s): 1.04 | learning rate: 2.089E-05 | global batch size: 256 | lm loss: 1.885185E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.285 | TFLOPs: 40.54 | 15: iteration 119880/ 125429 | consumed samples: 30689280 | consumed tokens: 62851645440 | elapsed time per iteration (s): 1.14 | learning rate: 2.089E-05 | global batch size: 256 | lm loss: 1.898489E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.008 | TFLOPs: 37.02 | 15: iteration 119890/ 125429 | consumed samples: 30691840 | consumed tokens: 62856888320 | elapsed time per iteration (s): 1.04 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.852682E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.510 | TFLOPs: 40.57 | 15: iteration 119900/ 125429 | consumed samples: 30694400 | consumed tokens: 62862131200 | elapsed time per iteration (s): 1.05 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.888799E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.809 | TFLOPs: 40.13 | 15: iteration 119910/ 125429 | consumed samples: 30696960 | consumed tokens: 62867374080 | elapsed time per iteration (s): 1.05 | learning rate: 2.088E-05 | global batch size: 256 | lm loss: 1.857513E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.095 | TFLOPs: 40.34 | 15: iteration 119920/ 125429 | consumed samples: 30699520 | consumed tokens: 62872616960 | elapsed time per iteration (s): 1.06 | learning rate: 2.087E-05 | global batch size: 256 | lm loss: 1.867986E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.283 | TFLOPs: 40.04 | 15: iteration 119930/ 125429 | consumed samples: 30702080 | consumed tokens: 62877859840 | elapsed time per iteration (s): 1.05 | learning rate: 2.087E-05 | global batch size: 256 | lm loss: 1.908379E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.084 | TFLOPs: 40.17 | 15: iteration 119940/ 125429 | consumed samples: 30704640 | consumed tokens: 62883102720 | elapsed time per iteration (s): 1.06 | learning rate: 2.087E-05 | global batch size: 256 | lm loss: 1.874576E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.429 | TFLOPs: 40.06 | 15: iteration 119950/ 125429 | consumed samples: 30707200 | consumed tokens: 62888345600 | elapsed time per iteration (s): 1.05 | learning rate: 2.086E-05 | global batch size: 256 | lm loss: 1.904088E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.640 | TFLOPs: 40.43 | 15: iteration 119960/ 125429 | consumed samples: 30709760 | consumed tokens: 62893588480 | elapsed time per iteration (s): 1.04 | learning rate: 2.086E-05 | global batch size: 256 | lm loss: 1.907009E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.259 | TFLOPs: 40.70 | 15: iteration 119970/ 125429 | consumed samples: 30712320 | consumed tokens: 62898831360 | elapsed time per iteration (s): 1.22 | learning rate: 2.086E-05 | global batch size: 256 | lm loss: 1.904147E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.353 | TFLOPs: 34.76 | 15: iteration 119980/ 125429 | consumed samples: 30714880 | consumed tokens: 62904074240 | elapsed time per iteration (s): 1.05 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.879357E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.930 | TFLOPs: 40.48 | 15: iteration 119990/ 125429 | consumed samples: 30717440 | consumed tokens: 62909317120 | elapsed time per iteration (s): 1.06 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.891149E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.526 | TFLOPs: 39.75 | 0: [2022-11-27 07:33:17,447] [INFO] [logging.py:68:log_dist] [Rank 0] step=120000, skipped=0, lr=[2.0847640496938153e-05, 2.0847640496938153e-05, 2.0847640496938153e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 120000/ 125429 | consumed samples: 30720000 | consumed tokens: 62914560000 | elapsed time per iteration (s): 1.03 | learning rate: 2.085E-05 | global batch size: 256 | lm loss: 1.886056E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.948 | TFLOPs: 41.14 | 0: steps: 120000 loss: 1.8645 iter time (s): 1.072 samples/sec: 238.725 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 120000 | lm loss value: 1.805055E+00 | lm loss PPL: 6.080309E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 120000 to checkpoints_1b5 0: [2022-11-27 07:33:17,828] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step120000 is begin to save! 0: [2022-11-27 07:33:17,838] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:33:18,070] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:33:18,070] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:33:18,173] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:33:18,173] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:33:18,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:33:18,281] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:33:18,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:33:18,391] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:33:18,698] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:33:18,699] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:33:18,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:33:18,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:33:18,921] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:33:18,921] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:33:19,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:33:19,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:33:19,134] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:33:19,135] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:33:19,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:33:19,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:33:19,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:33:19,354] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:33:19,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:33:19,464] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:33:19,571] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:33:19,572] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:33:19,677] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:33:19,677] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:33:19,787] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:33:19,787] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:33:19,894] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:33:19,894] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:33:20,003] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:33:20,004] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:33:20,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:33:20,112] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:33:20,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:33:20,221] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:33:20,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:33:20,331] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:33:20,441] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:33:20,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:33:20,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:33:20,551] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:33:20,666] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:33:20,666] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:33:20,775] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:33:20,775] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:33:20,880] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:33:20,880] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:33:20,989] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:33:20,989] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:33:21,100] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:33:21,100] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_29-model_00-model_states.pt... 0: [2022-11-27 07:33:21,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_29-model_00-model_states.pt. 0: [2022-11-27 07:33:21,209] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:33:21,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:33:21,317] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/layer_32-model_00-model_states.pt... 0: [2022-11-27 07:33:21,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/layer_32-model_00-model_states.pt. 0: [2022-11-27 07:33:21,323] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step120000/mp_rank_00_model_states.pt 0: [2022-11-27 07:33:21,323] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:33:21,326] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:33:21,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:33:21,364] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step120000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:33:21,517] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,517] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 07:33:21,517] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-27 07:33:21,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,520] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:33:21,526] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:33:21,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,526] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,526] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,532] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 07:33:21,532] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:33:21,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,533] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:33:21,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-27 07:33:21,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,533] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-27 07:33:21,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,539] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,539] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,541] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-27 07:33:21,541] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-27 07:33:21,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-27 07:33:21,544] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:33:21,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:33:21,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,545] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 12: [2022-11-27 07:33:21,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:33:21,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,548] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,548] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 07:33:21,548] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-27 07:33:21,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,545] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,545] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:33:21,546] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,546] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-27 07:33:21,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,534] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-27 07:33:21,534] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-27 07:33:21,535] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,535] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,535] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:33:21,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,536] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,536] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,537] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 8: [2022-11-27 07:33:21,537] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,553] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,553] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,553] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:33:21,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:33:21,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:33:21,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,554] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 2: [2022-11-27 07:33:21,551] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,554] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,551] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,551] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:33:21,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 2: [2022-11-27 07:33:21,552] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,555] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,552] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:33:21,555] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,556] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:33:21,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-27 07:33:21,557] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,558] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,558] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,558] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:33:21,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,560] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:33:21,560] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,560] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:33:21,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,539] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:33:21,540] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,540] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:33:21,542] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,542] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:33:21,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,556] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,556] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,547] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,547] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-27 07:33:21,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:33:21,550] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,550] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,561] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,561] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,561] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,562] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,563] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:33:21,563] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,563] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:33:21,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,564] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,564] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,564] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:33:21,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 2: [2022-11-27 07:33:21,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 14: [2022-11-27 07:33:21,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 07:33:21,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-27 07:33:21,567] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,567] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,567] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,568] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,565] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,565] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,568] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,565] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,568] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-27 07:33:21,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,570] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:33:21,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,570] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 07:33:21,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 3: [2022-11-27 07:33:21,570] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,573] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:33:21,573] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 07:33:21,573] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 9: [2022-11-27 07:33:21,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:33:21,576] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 07:33:21,576] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,576] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:33:21,577] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,577] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,572] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,572] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,578] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,578] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,578] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 6: [2022-11-27 07:33:21,585] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:33:21,585] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 07:33:21,585] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 4: [2022-11-27 07:33:21,581] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,588] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,588] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,588] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 10: [2022-11-27 07:33:21,589] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:33:21,589] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 07:33:21,589] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 15: [2022-11-27 07:33:21,566] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 07:33:21,566] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:33:21,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:33:21,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,592] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 1: [2022-11-27 07:33:21,592] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 13: [2022-11-27 07:33:21,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:33:21,586] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 07:33:21,586] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 7: [2022-11-27 07:33:21,614] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:33:21,615] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 07:33:21,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: [2022-11-27 07:33:21,629] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 07:33:21,629] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 5: [2022-11-27 07:33:21,633] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:33:21,633] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:33:21,633] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 11: [2022-11-27 07:33:21,638] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step120000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 07:33:21,638] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step120000 is ready now! 0: successfully saved checkpoint at iteration 120000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3858.25 15: iteration 120010/ 125429 | consumed samples: 30722560 | consumed tokens: 62919802880 | elapsed time per iteration (s): 1.44 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.872509E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.397 | TFLOPs: 29.32 | 15: iteration 120020/ 125429 | consumed samples: 30725120 | consumed tokens: 62925045760 | elapsed time per iteration (s): 1.07 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.881900E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.047 | TFLOPs: 39.67 | 15: iteration 120030/ 125429 | consumed samples: 30727680 | consumed tokens: 62930288640 | elapsed time per iteration (s): 1.03 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.878963E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.729 | TFLOPs: 40.94 | 15: iteration 120040/ 125429 | consumed samples: 30730240 | consumed tokens: 62935531520 | elapsed time per iteration (s): 1.04 | learning rate: 2.084E-05 | global batch size: 256 | lm loss: 1.885959E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.244 | TFLOPs: 40.53 | 15: iteration 120050/ 125429 | consumed samples: 30732800 | consumed tokens: 62940774400 | elapsed time per iteration (s): 1.06 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.898756E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.945 | TFLOPs: 39.82 | 15: iteration 120060/ 125429 | consumed samples: 30735360 | consumed tokens: 62946017280 | elapsed time per iteration (s): 1.03 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.904052E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.216 | TFLOPs: 41.02 | 15: iteration 120070/ 125429 | consumed samples: 30737920 | consumed tokens: 62951260160 | elapsed time per iteration (s): 1.03 | learning rate: 2.083E-05 | global batch size: 256 | lm loss: 1.860716E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.220 | TFLOPs: 41.02 | 15: iteration 120080/ 125429 | consumed samples: 30740480 | consumed tokens: 62956503040 | elapsed time per iteration (s): 1.03 | learning rate: 2.082E-05 | global batch size: 256 | lm loss: 1.895474E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.983 | TFLOPs: 40.98 | 15: iteration 120090/ 125429 | consumed samples: 30743040 | consumed tokens: 62961745920 | elapsed time per iteration (s): 1.04 | learning rate: 2.082E-05 | global batch size: 256 | lm loss: 1.901288E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.254 | TFLOPs: 40.70 | 15: iteration 120100/ 125429 | consumed samples: 30745600 | consumed tokens: 62966988800 | elapsed time per iteration (s): 1.03 | learning rate: 2.082E-05 | global batch size: 256 | lm loss: 1.887384E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.077 | TFLOPs: 41.00 | 15: iteration 120110/ 125429 | consumed samples: 30748160 | consumed tokens: 62972231680 | elapsed time per iteration (s): 1.21 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.893296E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 210.898 | TFLOPs: 34.85 | 15: iteration 120120/ 125429 | consumed samples: 30750720 | consumed tokens: 62977474560 | elapsed time per iteration (s): 1.04 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.885826E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.279 | TFLOPs: 40.53 | 15: iteration 120130/ 125429 | consumed samples: 30753280 | consumed tokens: 62982717440 | elapsed time per iteration (s): 1.03 | learning rate: 2.081E-05 | global batch size: 256 | lm loss: 1.884964E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.360 | TFLOPs: 40.88 | 15: iteration 120140/ 125429 | consumed samples: 30755840 | consumed tokens: 62987960320 | elapsed time per iteration (s): 1.02 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.873956E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.883 | TFLOPs: 41.30 | 15: iteration 120150/ 125429 | consumed samples: 30758400 | consumed tokens: 62993203200 | elapsed time per iteration (s): 1.04 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.890952E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.358 | TFLOPs: 40.71 | 15: iteration 120160/ 125429 | consumed samples: 30760960 | consumed tokens: 62998446080 | elapsed time per iteration (s): 1.03 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.920908E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.259 | TFLOPs: 41.19 | 15: iteration 120170/ 125429 | consumed samples: 30763520 | consumed tokens: 63003688960 | elapsed time per iteration (s): 1.04 | learning rate: 2.080E-05 | global batch size: 256 | lm loss: 1.921731E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.202 | TFLOPs: 40.52 | 15: iteration 120180/ 125429 | consumed samples: 30766080 | consumed tokens: 63008931840 | elapsed time per iteration (s): 1.07 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.902378E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.231 | TFLOPs: 39.70 | 15: iteration 120190/ 125429 | consumed samples: 30768640 | consumed tokens: 63014174720 | elapsed time per iteration (s): 1.04 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.924185E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.653 | TFLOPs: 40.60 | 15: iteration 120200/ 125429 | consumed samples: 30771200 | consumed tokens: 63019417600 | elapsed time per iteration (s): 1.05 | learning rate: 2.079E-05 | global batch size: 256 | lm loss: 1.884332E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.936 | TFLOPs: 40.31 | 15: iteration 120210/ 125429 | consumed samples: 30773760 | consumed tokens: 63024660480 | elapsed time per iteration (s): 1.37 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.894170E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 186.816 | TFLOPs: 30.87 | 15: iteration 120220/ 125429 | consumed samples: 30776320 | consumed tokens: 63029903360 | elapsed time per iteration (s): 1.26 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.878195E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 202.711 | TFLOPs: 33.50 | 15: iteration 120230/ 125429 | consumed samples: 30778880 | consumed tokens: 63035146240 | elapsed time per iteration (s): 1.03 | learning rate: 2.078E-05 | global batch size: 256 | lm loss: 1.859332E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.426 | TFLOPs: 40.89 | 15: iteration 120240/ 125429 | consumed samples: 30781440 | consumed tokens: 63040389120 | elapsed time per iteration (s): 1.06 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.904608E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.099 | TFLOPs: 40.01 | 15: iteration 120250/ 125429 | consumed samples: 30784000 | consumed tokens: 63045632000 | elapsed time per iteration (s): 1.07 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.883643E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.714 | TFLOPs: 39.45 | 15: iteration 120260/ 125429 | consumed samples: 30786560 | consumed tokens: 63050874880 | elapsed time per iteration (s): 1.03 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.887109E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.862 | TFLOPs: 40.96 | 15: iteration 120270/ 125429 | consumed samples: 30789120 | consumed tokens: 63056117760 | elapsed time per iteration (s): 1.06 | learning rate: 2.077E-05 | global batch size: 256 | lm loss: 1.885513E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.647 | TFLOPs: 40.10 | 15: iteration 120280/ 125429 | consumed samples: 30791680 | consumed tokens: 63061360640 | elapsed time per iteration (s): 1.02 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.867676E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.018 | TFLOPs: 41.32 | 15: iteration 120290/ 125429 | consumed samples: 30794240 | consumed tokens: 63066603520 | elapsed time per iteration (s): 1.02 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.883851E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.842 | TFLOPs: 41.29 | 15: iteration 120300/ 125429 | consumed samples: 30796800 | consumed tokens: 63071846400 | elapsed time per iteration (s): 1.02 | learning rate: 2.076E-05 | global batch size: 256 | lm loss: 1.877967E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.113 | TFLOPs: 41.50 | 15: iteration 120310/ 125429 | consumed samples: 30799360 | consumed tokens: 63077089280 | elapsed time per iteration (s): 1.02 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.891379E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.990 | TFLOPs: 41.48 | 15: iteration 120320/ 125429 | consumed samples: 30801920 | consumed tokens: 63082332160 | elapsed time per iteration (s): 1.04 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.907384E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.019 | TFLOPs: 40.49 | 15: iteration 120330/ 125429 | consumed samples: 30804480 | consumed tokens: 63087575040 | elapsed time per iteration (s): 1.06 | learning rate: 2.075E-05 | global batch size: 256 | lm loss: 1.889457E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.895 | TFLOPs: 39.98 | 15: iteration 120340/ 125429 | consumed samples: 30807040 | consumed tokens: 63092817920 | elapsed time per iteration (s): 1.08 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.889533E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.650 | TFLOPs: 39.11 | 15: iteration 120350/ 125429 | consumed samples: 30809600 | consumed tokens: 63098060800 | elapsed time per iteration (s): 1.05 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.889817E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.770 | TFLOPs: 40.12 | 15: iteration 120360/ 125429 | consumed samples: 30812160 | consumed tokens: 63103303680 | elapsed time per iteration (s): 1.06 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.892460E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.242 | TFLOPs: 39.87 | 15: iteration 120370/ 125429 | consumed samples: 30814720 | consumed tokens: 63108546560 | elapsed time per iteration (s): 1.07 | learning rate: 2.074E-05 | global batch size: 256 | lm loss: 1.893347E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.090 | TFLOPs: 39.68 | 15: iteration 120380/ 125429 | consumed samples: 30817280 | consumed tokens: 63113789440 | elapsed time per iteration (s): 1.19 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.885364E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.782 | TFLOPs: 35.66 | 15: iteration 120390/ 125429 | consumed samples: 30819840 | consumed tokens: 63119032320 | elapsed time per iteration (s): 1.03 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.873788E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.376 | TFLOPs: 41.05 | 15: iteration 120400/ 125429 | consumed samples: 30822400 | consumed tokens: 63124275200 | elapsed time per iteration (s): 1.07 | learning rate: 2.073E-05 | global batch size: 256 | lm loss: 1.897467E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.218 | TFLOPs: 39.70 | 15: iteration 120410/ 125429 | consumed samples: 30824960 | consumed tokens: 63129518080 | elapsed time per iteration (s): 1.02 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.886638E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.583 | TFLOPs: 41.58 | 15: iteration 120420/ 125429 | consumed samples: 30827520 | consumed tokens: 63134760960 | elapsed time per iteration (s): 1.04 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.902047E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.450 | TFLOPs: 40.56 | 15: iteration 120430/ 125429 | consumed samples: 30830080 | consumed tokens: 63140003840 | elapsed time per iteration (s): 1.04 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.895082E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.356 | TFLOPs: 40.71 | 15: iteration 120440/ 125429 | consumed samples: 30832640 | consumed tokens: 63145246720 | elapsed time per iteration (s): 1.05 | learning rate: 2.072E-05 | global batch size: 256 | lm loss: 1.896795E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.472 | TFLOPs: 40.24 | 15: iteration 120450/ 125429 | consumed samples: 30835200 | consumed tokens: 63150489600 | elapsed time per iteration (s): 1.04 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.879936E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.666 | TFLOPs: 40.76 | 15: iteration 120460/ 125429 | consumed samples: 30837760 | consumed tokens: 63155732480 | elapsed time per iteration (s): 1.05 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.885277E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.140 | TFLOPs: 40.18 | 15: iteration 120470/ 125429 | consumed samples: 30840320 | consumed tokens: 63160975360 | elapsed time per iteration (s): 1.20 | learning rate: 2.071E-05 | global batch size: 256 | lm loss: 1.871225E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.104 | TFLOPs: 35.38 | 15: iteration 120480/ 125429 | consumed samples: 30842880 | consumed tokens: 63166218240 | elapsed time per iteration (s): 1.04 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.884767E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.903 | TFLOPs: 40.80 | 15: iteration 120490/ 125429 | consumed samples: 30845440 | consumed tokens: 63171461120 | elapsed time per iteration (s): 1.05 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.887786E+00 | grad norm: 0.156 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.986 | TFLOPs: 40.16 | 15: iteration 120500/ 125429 | consumed samples: 30848000 | consumed tokens: 63176704000 | elapsed time per iteration (s): 1.03 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.886592E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.543 | TFLOPs: 40.91 | 15: iteration 120510/ 125429 | consumed samples: 30850560 | consumed tokens: 63181946880 | elapsed time per iteration (s): 1.07 | learning rate: 2.070E-05 | global batch size: 256 | lm loss: 1.841088E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.245 | TFLOPs: 39.54 | 15: iteration 120520/ 125429 | consumed samples: 30853120 | consumed tokens: 63187189760 | elapsed time per iteration (s): 1.02 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.899333E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.685 | TFLOPs: 41.43 | 15: iteration 120530/ 125429 | consumed samples: 30855680 | consumed tokens: 63192432640 | elapsed time per iteration (s): 1.03 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.867913E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.051 | TFLOPs: 40.99 | 15: iteration 120540/ 125429 | consumed samples: 30858240 | consumed tokens: 63197675520 | elapsed time per iteration (s): 1.07 | learning rate: 2.069E-05 | global batch size: 256 | lm loss: 1.897398E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.163 | TFLOPs: 39.52 | 15: iteration 120550/ 125429 | consumed samples: 30860800 | consumed tokens: 63202918400 | elapsed time per iteration (s): 1.04 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.908075E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.376 | TFLOPs: 40.72 | 15: iteration 120560/ 125429 | consumed samples: 30863360 | consumed tokens: 63208161280 | elapsed time per iteration (s): 1.14 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.899264E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.639 | TFLOPs: 37.12 | 15: iteration 120570/ 125429 | consumed samples: 30865920 | consumed tokens: 63213404160 | elapsed time per iteration (s): 1.04 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.915320E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.992 | TFLOPs: 40.82 | 15: iteration 120580/ 125429 | consumed samples: 30868480 | consumed tokens: 63218647040 | elapsed time per iteration (s): 1.03 | learning rate: 2.068E-05 | global batch size: 256 | lm loss: 1.866628E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.144 | TFLOPs: 41.01 | 15: iteration 120590/ 125429 | consumed samples: 30871040 | consumed tokens: 63223889920 | elapsed time per iteration (s): 1.03 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.875258E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.468 | TFLOPs: 41.06 | 15: iteration 120600/ 125429 | consumed samples: 30873600 | consumed tokens: 63229132800 | elapsed time per iteration (s): 1.05 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.900789E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.674 | TFLOPs: 40.43 | 15: iteration 120610/ 125429 | consumed samples: 30876160 | consumed tokens: 63234375680 | elapsed time per iteration (s): 1.04 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.902616E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.927 | TFLOPs: 40.64 | 15: iteration 120620/ 125429 | consumed samples: 30878720 | consumed tokens: 63239618560 | elapsed time per iteration (s): 1.03 | learning rate: 2.067E-05 | global batch size: 256 | lm loss: 1.898868E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.085 | TFLOPs: 41.16 | 15: iteration 120630/ 125429 | consumed samples: 30881280 | consumed tokens: 63244861440 | elapsed time per iteration (s): 1.06 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.860423E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.153 | TFLOPs: 39.85 | 15: iteration 120640/ 125429 | consumed samples: 30883840 | consumed tokens: 63250104320 | elapsed time per iteration (s): 1.06 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.891716E+00 | grad norm: 0.154 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.206 | TFLOPs: 40.03 | 15: iteration 120650/ 125429 | consumed samples: 30886400 | consumed tokens: 63255347200 | elapsed time per iteration (s): 1.05 | learning rate: 2.066E-05 | global batch size: 256 | lm loss: 1.894018E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.679 | TFLOPs: 40.10 | 15: iteration 120660/ 125429 | consumed samples: 30888960 | consumed tokens: 63260590080 | elapsed time per iteration (s): 1.05 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.872444E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.953 | TFLOPs: 40.15 | 15: iteration 120670/ 125429 | consumed samples: 30891520 | consumed tokens: 63265832960 | elapsed time per iteration (s): 1.06 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.908852E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.353 | TFLOPs: 39.89 | 15: iteration 120680/ 125429 | consumed samples: 30894080 | consumed tokens: 63271075840 | elapsed time per iteration (s): 1.06 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.897375E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.566 | TFLOPs: 39.76 | 15: iteration 120690/ 125429 | consumed samples: 30896640 | consumed tokens: 63276318720 | elapsed time per iteration (s): 1.05 | learning rate: 2.065E-05 | global batch size: 256 | lm loss: 1.890560E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.851 | TFLOPs: 40.13 | 15: iteration 120700/ 125429 | consumed samples: 30899200 | consumed tokens: 63281561600 | elapsed time per iteration (s): 1.03 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.863530E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.599 | TFLOPs: 40.92 | 15: iteration 120710/ 125429 | consumed samples: 30901760 | consumed tokens: 63286804480 | elapsed time per iteration (s): 1.04 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.898790E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.530 | TFLOPs: 40.58 | 15: iteration 120720/ 125429 | consumed samples: 30904320 | consumed tokens: 63292047360 | elapsed time per iteration (s): 1.06 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.892751E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.228 | TFLOPs: 40.03 | 15: iteration 120730/ 125429 | consumed samples: 30906880 | consumed tokens: 63297290240 | elapsed time per iteration (s): 1.05 | learning rate: 2.064E-05 | global batch size: 256 | lm loss: 1.870040E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.961 | TFLOPs: 40.32 | 15: iteration 120740/ 125429 | consumed samples: 30909440 | consumed tokens: 63302533120 | elapsed time per iteration (s): 1.20 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.883533E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 213.407 | TFLOPs: 35.27 | 15: iteration 120750/ 125429 | consumed samples: 30912000 | consumed tokens: 63307776000 | elapsed time per iteration (s): 1.04 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.885430E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.998 | TFLOPs: 40.49 | 15: iteration 120760/ 125429 | consumed samples: 30914560 | consumed tokens: 63313018880 | elapsed time per iteration (s): 1.04 | learning rate: 2.063E-05 | global batch size: 256 | lm loss: 1.875926E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.867 | TFLOPs: 40.80 | 15: iteration 120770/ 125429 | consumed samples: 30917120 | consumed tokens: 63318261760 | elapsed time per iteration (s): 1.05 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.894294E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.074 | TFLOPs: 40.34 | 15: iteration 120780/ 125429 | consumed samples: 30919680 | consumed tokens: 63323504640 | elapsed time per iteration (s): 1.06 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.877535E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.598 | TFLOPs: 40.09 | 15: iteration 120790/ 125429 | consumed samples: 30922240 | consumed tokens: 63328747520 | elapsed time per iteration (s): 1.07 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.908387E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.302 | TFLOPs: 39.71 | 15: iteration 120800/ 125429 | consumed samples: 30924800 | consumed tokens: 63333990400 | elapsed time per iteration (s): 1.06 | learning rate: 2.062E-05 | global batch size: 256 | lm loss: 1.860553E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.878 | TFLOPs: 39.81 | 15: iteration 120810/ 125429 | consumed samples: 30927360 | consumed tokens: 63339233280 | elapsed time per iteration (s): 1.04 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.891541E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.768 | TFLOPs: 40.62 | 15: iteration 120820/ 125429 | consumed samples: 30929920 | consumed tokens: 63344476160 | elapsed time per iteration (s): 1.06 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.906091E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.210 | TFLOPs: 39.86 | 15: iteration 120830/ 125429 | consumed samples: 30932480 | consumed tokens: 63349719040 | elapsed time per iteration (s): 1.07 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.866411E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.341 | TFLOPs: 39.39 | 15: iteration 120840/ 125429 | consumed samples: 30935040 | consumed tokens: 63354961920 | elapsed time per iteration (s): 1.04 | learning rate: 2.061E-05 | global batch size: 256 | lm loss: 1.899843E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.263 | TFLOPs: 40.86 | 15: iteration 120850/ 125429 | consumed samples: 30937600 | consumed tokens: 63360204800 | elapsed time per iteration (s): 1.02 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.895653E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.070 | TFLOPs: 41.49 | 15: iteration 120860/ 125429 | consumed samples: 30940160 | consumed tokens: 63365447680 | elapsed time per iteration (s): 1.04 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.875986E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.410 | TFLOPs: 40.56 | 15: iteration 120870/ 125429 | consumed samples: 30942720 | consumed tokens: 63370690560 | elapsed time per iteration (s): 1.09 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.902224E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.264 | TFLOPs: 38.71 | 15: iteration 120880/ 125429 | consumed samples: 30945280 | consumed tokens: 63375933440 | elapsed time per iteration (s): 1.04 | learning rate: 2.060E-05 | global batch size: 256 | lm loss: 1.916051E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.778 | TFLOPs: 40.62 | 15: iteration 120890/ 125429 | consumed samples: 30947840 | consumed tokens: 63381176320 | elapsed time per iteration (s): 1.03 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.920465E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.501 | TFLOPs: 41.07 | 15: iteration 120900/ 125429 | consumed samples: 30950400 | consumed tokens: 63386419200 | elapsed time per iteration (s): 1.03 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.902521E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.309 | TFLOPs: 41.04 | 15: iteration 120910/ 125429 | consumed samples: 30952960 | consumed tokens: 63391662080 | elapsed time per iteration (s): 1.04 | learning rate: 2.059E-05 | global batch size: 256 | lm loss: 1.875733E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.674 | TFLOPs: 40.76 | 15: iteration 120920/ 125429 | consumed samples: 30955520 | consumed tokens: 63396904960 | elapsed time per iteration (s): 1.07 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.914855E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.439 | TFLOPs: 39.57 | 15: iteration 120930/ 125429 | consumed samples: 30958080 | consumed tokens: 63402147840 | elapsed time per iteration (s): 1.04 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.894821E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.219 | TFLOPs: 40.69 | 15: iteration 120940/ 125429 | consumed samples: 30960640 | consumed tokens: 63407390720 | elapsed time per iteration (s): 1.06 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.890297E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.479 | TFLOPs: 39.91 | 15: iteration 120950/ 125429 | consumed samples: 30963200 | consumed tokens: 63412633600 | elapsed time per iteration (s): 1.05 | learning rate: 2.058E-05 | global batch size: 256 | lm loss: 1.888151E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.967 | TFLOPs: 40.15 | 15: iteration 120960/ 125429 | consumed samples: 30965760 | consumed tokens: 63417876480 | elapsed time per iteration (s): 1.06 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.899730E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.729 | TFLOPs: 39.95 | 15: iteration 120970/ 125429 | consumed samples: 30968320 | consumed tokens: 63423119360 | elapsed time per iteration (s): 1.06 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.918810E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.486 | TFLOPs: 39.74 | 15: iteration 120980/ 125429 | consumed samples: 30970880 | consumed tokens: 63428362240 | elapsed time per iteration (s): 1.05 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.887445E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.113 | TFLOPs: 40.34 | 15: iteration 120990/ 125429 | consumed samples: 30973440 | consumed tokens: 63433605120 | elapsed time per iteration (s): 1.07 | learning rate: 2.057E-05 | global batch size: 256 | lm loss: 1.908742E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.502 | TFLOPs: 39.41 | 15: iteration 121000/ 125429 | consumed samples: 30976000 | consumed tokens: 63438848000 | elapsed time per iteration (s): 1.05 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.887109E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.800 | TFLOPs: 40.12 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 121000 | lm loss value: 1.837415E+00 | lm loss PPL: 6.280285E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 121000 to checkpoints_1b5 0: [2022-11-27 07:51:00,877] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step121000 is begin to save! 0: [2022-11-27 07:51:00,885] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_01-model_00-model_states.pt... 0: [2022-11-27 07:51:01,136] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_01-model_00-model_states.pt. 0: [2022-11-27 07:51:01,137] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_03-model_00-model_states.pt... 0: [2022-11-27 07:51:01,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_03-model_00-model_states.pt. 0: [2022-11-27 07:51:01,240] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_04-model_00-model_states.pt... 0: [2022-11-27 07:51:01,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_04-model_00-model_states.pt. 0: [2022-11-27 07:51:01,344] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_05-model_00-model_states.pt... 0: [2022-11-27 07:51:01,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_05-model_00-model_states.pt. 0: [2022-11-27 07:51:01,445] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_06-model_00-model_states.pt... 0: [2022-11-27 07:51:01,555] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_06-model_00-model_states.pt. 0: [2022-11-27 07:51:01,555] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_07-model_00-model_states.pt... 0: [2022-11-27 07:51:01,660] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_07-model_00-model_states.pt. 0: [2022-11-27 07:51:01,660] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_08-model_00-model_states.pt... 0: [2022-11-27 07:51:01,763] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_08-model_00-model_states.pt. 0: [2022-11-27 07:51:01,764] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_09-model_00-model_states.pt... 0: [2022-11-27 07:51:01,869] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_09-model_00-model_states.pt. 0: [2022-11-27 07:51:01,869] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_10-model_00-model_states.pt... 0: [2022-11-27 07:51:01,973] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_10-model_00-model_states.pt. 0: [2022-11-27 07:51:01,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_11-model_00-model_states.pt... 0: [2022-11-27 07:51:02,077] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_11-model_00-model_states.pt. 0: [2022-11-27 07:51:02,078] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_12-model_00-model_states.pt... 0: [2022-11-27 07:51:02,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_12-model_00-model_states.pt. 0: [2022-11-27 07:51:02,185] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_13-model_00-model_states.pt... 0: [2022-11-27 07:51:02,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_13-model_00-model_states.pt. 0: [2022-11-27 07:51:02,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_14-model_00-model_states.pt... 0: [2022-11-27 07:51:02,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_14-model_00-model_states.pt. 0: [2022-11-27 07:51:02,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_15-model_00-model_states.pt... 0: [2022-11-27 07:51:02,506] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_15-model_00-model_states.pt. 0: [2022-11-27 07:51:02,506] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_16-model_00-model_states.pt... 0: [2022-11-27 07:51:02,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_16-model_00-model_states.pt. 0: [2022-11-27 07:51:02,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_17-model_00-model_states.pt... 0: [2022-11-27 07:51:02,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_17-model_00-model_states.pt. 0: [2022-11-27 07:51:02,716] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_18-model_00-model_states.pt... 0: [2022-11-27 07:51:02,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_18-model_00-model_states.pt. 0: [2022-11-27 07:51:02,822] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_19-model_00-model_states.pt... 0: [2022-11-27 07:51:02,924] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_19-model_00-model_states.pt. 0: [2022-11-27 07:51:02,925] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_20-model_00-model_states.pt... 0: [2022-11-27 07:51:03,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_20-model_00-model_states.pt. 0: [2022-11-27 07:51:03,027] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_21-model_00-model_states.pt... 0: [2022-11-27 07:51:03,127] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_21-model_00-model_states.pt. 0: [2022-11-27 07:51:03,128] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_22-model_00-model_states.pt... 0: [2022-11-27 07:51:03,229] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_22-model_00-model_states.pt. 0: [2022-11-27 07:51:03,230] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_23-model_00-model_states.pt... 0: [2022-11-27 07:51:03,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_23-model_00-model_states.pt. 0: [2022-11-27 07:51:03,336] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_24-model_00-model_states.pt... 0: [2022-11-27 07:51:03,440] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_24-model_00-model_states.pt. 0: [2022-11-27 07:51:03,441] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_25-model_00-model_states.pt... 0: [2022-11-27 07:51:03,540] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_25-model_00-model_states.pt. 0: [2022-11-27 07:51:03,541] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_26-model_00-model_states.pt... 0: [2022-11-27 07:51:03,643] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_26-model_00-model_states.pt. 0: [2022-11-27 07:51:03,643] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_27-model_00-model_states.pt... 0: [2022-11-27 07:51:03,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_27-model_00-model_states.pt. 0: [2022-11-27 07:51:03,744] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_28-model_00-model_states.pt... 0: [2022-11-27 07:51:03,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_28-model_00-model_states.pt. 0: [2022-11-27 07:51:03,848] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_29-model_00-model_states.pt... 0: [2022-11-27 07:51:03,956] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_29-model_00-model_states.pt. 0: [2022-11-27 07:51:03,956] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_30-model_00-model_states.pt... 0: [2022-11-27 07:51:04,062] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_30-model_00-model_states.pt. 0: [2022-11-27 07:51:04,063] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/layer_32-model_00-model_states.pt... 0: [2022-11-27 07:51:04,067] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/layer_32-model_00-model_states.pt. 0: [2022-11-27 07:51:04,069] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step121000/mp_rank_00_model_states.pt 0: [2022-11-27 07:51:04,069] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/mp_rank_00_model_states.pt... 0: [2022-11-27 07:51:04,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/mp_rank_00_model_states.pt. 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 10: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 8: [2022-11-27 07:51:04,111] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step121000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 12: [2022-11-27 07:51:04,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-27 07:51:04,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-27 07:51:04,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-27 07:51:04,274] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,274] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-27 07:51:04,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,275] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 07:51:04,275] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:51:04,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,278] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,278] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,278] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-27 07:51:04,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 07:51:04,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-27 07:51:04,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-27 07:51:04,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,282] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 07:51:04,282] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:51:04,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-27 07:51:04,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 07:51:04,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,286] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,287] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:51:04,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:51:04,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:51:04,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,289] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,289] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,289] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:51:04,290] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,290] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,290] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:51:04,291] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,291] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,291] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:51:04,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:51:04,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-27 07:51:04,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,295] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,295] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-27 07:51:04,295] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 07:51:04,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-27 07:51:04,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:51:04,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:51:04,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 7: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 07:51:04,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 07:51:04,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 4: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 9: [2022-11-27 07:51:04,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,308] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:51:04,308] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,308] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-27 07:51:04,309] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,309] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,309] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 12: [2022-11-27 07:51:04,310] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 07:51:04,310] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 07:51:04,310] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,311] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,311] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 07:51:04,311] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,279] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,279] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 07:51:04,279] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,283] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,282] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:51:04,283] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,283] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:51:04,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 13: [2022-11-27 07:51:04,285] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,285] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,288] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:51:04,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 13: [2022-11-27 07:51:04,288] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,288] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,292] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,301] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:51:04,292] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 13: [2022-11-27 07:51:04,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,292] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,301] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 07:51:04,303] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 5: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 07:51:04,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:51:04,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-27 07:51:04,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-27 07:51:04,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-27 07:51:04,318] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-27 07:51:04,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-27 07:51:04,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:51:04,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,319] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:51:04,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:51:04,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:51:04,320] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 07:51:04,320] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 1: [2022-11-27 07:51:04,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 07:51:04,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:51:04,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,313] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,313] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:51:04,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:51:04,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 15: [2022-11-27 07:51:04,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 07:51:04,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 07:51:04,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-27 07:51:04,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 6: [2022-11-27 07:51:04,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 07:51:04,337] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 07:51:04,337] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 2: [2022-11-27 07:51:04,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 07:51:04,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 07:51:04,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 13: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 07:51:04,318] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 07:51:04,318] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-27 07:51:04,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 14: [2022-11-27 07:51:04,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 07:51:04,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 07:51:04,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 07:51:04,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 4: [2022-11-27 07:51:04,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 07:51:04,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 07:51:04,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,317] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,317] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:51:04,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 07:51:04,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 11: [2022-11-27 07:51:04,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 07:51:04,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 3: [2022-11-27 07:51:04,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 07:51:04,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 07:51:04,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 8: [2022-11-27 07:51:04,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 07:51:04,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 07:51:04,413] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,448] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 07:51:04,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,448] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,448] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: [2022-11-27 07:51:04,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,530] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,530] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,531] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 07:51:04,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,531] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step121000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 07:51:04,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 10: [2022-11-27 07:51:04,531] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step121000 is ready now! 0: successfully saved checkpoint at iteration 121000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3698.43 15: iteration 121010/ 125429 | consumed samples: 30978560 | consumed tokens: 63444090880 | elapsed time per iteration (s): 1.44 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.912032E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.691 | TFLOPs: 29.36 | 15: iteration 121020/ 125429 | consumed samples: 30981120 | consumed tokens: 63449333760 | elapsed time per iteration (s): 1.07 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.902077E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.221 | TFLOPs: 39.70 | 15: iteration 121030/ 125429 | consumed samples: 30983680 | consumed tokens: 63454576640 | elapsed time per iteration (s): 1.03 | learning rate: 2.056E-05 | global batch size: 256 | lm loss: 1.892077E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.850 | TFLOPs: 41.12 | 15: iteration 121040/ 125429 | consumed samples: 30986240 | consumed tokens: 63459819520 | elapsed time per iteration (s): 1.04 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.881999E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.588 | TFLOPs: 40.59 | 15: iteration 121050/ 125429 | consumed samples: 30988800 | consumed tokens: 63465062400 | elapsed time per iteration (s): 1.03 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.905637E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.154 | TFLOPs: 41.17 | 15: iteration 121060/ 125429 | consumed samples: 30991360 | consumed tokens: 63470305280 | elapsed time per iteration (s): 1.05 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.889014E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.453 | TFLOPs: 40.40 | 15: iteration 121070/ 125429 | consumed samples: 30993920 | consumed tokens: 63475548160 | elapsed time per iteration (s): 1.04 | learning rate: 2.055E-05 | global batch size: 256 | lm loss: 1.892854E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.239 | TFLOPs: 40.69 | 15: iteration 121080/ 125429 | consumed samples: 30996480 | consumed tokens: 63480791040 | elapsed time per iteration (s): 1.03 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.917891E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.972 | TFLOPs: 40.98 | 15: iteration 121090/ 125429 | consumed samples: 30999040 | consumed tokens: 63486033920 | elapsed time per iteration (s): 1.02 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.919032E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.601 | TFLOPs: 41.58 | 15: iteration 121100/ 125429 | consumed samples: 31001600 | consumed tokens: 63491276800 | elapsed time per iteration (s): 1.05 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.899559E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.629 | TFLOPs: 40.26 | 15: iteration 121110/ 125429 | consumed samples: 31004160 | consumed tokens: 63496519680 | elapsed time per iteration (s): 1.03 | learning rate: 2.054E-05 | global batch size: 256 | lm loss: 1.911806E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.329 | TFLOPs: 41.04 | 15: iteration 121120/ 125429 | consumed samples: 31006720 | consumed tokens: 63501762560 | elapsed time per iteration (s): 1.04 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.884234E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.660 | TFLOPs: 40.60 | 15: iteration 121130/ 125429 | consumed samples: 31009280 | consumed tokens: 63507005440 | elapsed time per iteration (s): 1.05 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.896937E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.110 | TFLOPs: 40.18 | 15: iteration 121140/ 125429 | consumed samples: 31011840 | consumed tokens: 63512248320 | elapsed time per iteration (s): 1.03 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.910905E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.504 | TFLOPs: 41.07 | 15: iteration 121150/ 125429 | consumed samples: 31014400 | consumed tokens: 63517491200 | elapsed time per iteration (s): 1.05 | learning rate: 2.053E-05 | global batch size: 256 | lm loss: 1.903929E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.132 | TFLOPs: 40.18 | 15: iteration 121160/ 125429 | consumed samples: 31016960 | consumed tokens: 63522734080 | elapsed time per iteration (s): 1.04 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.869756E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.443 | TFLOPs: 40.73 | 15: iteration 121170/ 125429 | consumed samples: 31019520 | consumed tokens: 63527976960 | elapsed time per iteration (s): 1.04 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.861287E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.984 | TFLOPs: 40.65 | 15: iteration 121180/ 125429 | consumed samples: 31022080 | consumed tokens: 63533219840 | elapsed time per iteration (s): 1.03 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.882771E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.650 | TFLOPs: 40.93 | 15: iteration 121190/ 125429 | consumed samples: 31024640 | consumed tokens: 63538462720 | elapsed time per iteration (s): 1.04 | learning rate: 2.052E-05 | global batch size: 256 | lm loss: 1.887646E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.056 | TFLOPs: 40.66 | 15: iteration 121200/ 125429 | consumed samples: 31027200 | consumed tokens: 63543705600 | elapsed time per iteration (s): 1.06 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.878718E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.515 | TFLOPs: 39.91 | 15: iteration 121210/ 125429 | consumed samples: 31029760 | consumed tokens: 63548948480 | elapsed time per iteration (s): 1.02 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.900697E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.795 | TFLOPs: 41.45 | 15: iteration 121220/ 125429 | consumed samples: 31032320 | consumed tokens: 63554191360 | elapsed time per iteration (s): 1.02 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.891197E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.875 | TFLOPs: 41.46 | 15: iteration 121230/ 125429 | consumed samples: 31034880 | consumed tokens: 63559434240 | elapsed time per iteration (s): 1.02 | learning rate: 2.051E-05 | global batch size: 256 | lm loss: 1.895466E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.617 | TFLOPs: 41.58 | 15: iteration 121240/ 125429 | consumed samples: 31037440 | consumed tokens: 63564677120 | elapsed time per iteration (s): 1.02 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.867989E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.865 | TFLOPs: 41.29 | 15: iteration 121250/ 125429 | consumed samples: 31040000 | consumed tokens: 63569920000 | elapsed time per iteration (s): 1.06 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.890836E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.375 | TFLOPs: 40.05 | 15: iteration 121260/ 125429 | consumed samples: 31042560 | consumed tokens: 63575162880 | elapsed time per iteration (s): 1.04 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.870848E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.204 | TFLOPs: 40.52 | 15: iteration 121270/ 125429 | consumed samples: 31045120 | consumed tokens: 63580405760 | elapsed time per iteration (s): 1.03 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.892860E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.050 | TFLOPs: 40.99 | 15: iteration 121280/ 125429 | consumed samples: 31047680 | consumed tokens: 63585648640 | elapsed time per iteration (s): 1.02 | learning rate: 2.050E-05 | global batch size: 256 | lm loss: 1.910384E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.537 | TFLOPs: 41.40 | 15: iteration 121290/ 125429 | consumed samples: 31050240 | consumed tokens: 63590891520 | elapsed time per iteration (s): 1.05 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.892280E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.421 | TFLOPs: 40.23 | 15: iteration 121300/ 125429 | consumed samples: 31052800 | consumed tokens: 63596134400 | elapsed time per iteration (s): 1.04 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.878178E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.603 | TFLOPs: 40.59 | 15: iteration 121310/ 125429 | consumed samples: 31055360 | consumed tokens: 63601377280 | elapsed time per iteration (s): 1.03 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.897831E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.516 | TFLOPs: 41.07 | 15: iteration 121320/ 125429 | consumed samples: 31057920 | consumed tokens: 63606620160 | elapsed time per iteration (s): 1.02 | learning rate: 2.049E-05 | global batch size: 256 | lm loss: 1.875872E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.503 | TFLOPs: 41.56 | 15: iteration 121330/ 125429 | consumed samples: 31060480 | consumed tokens: 63611863040 | elapsed time per iteration (s): 1.04 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.897552E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.119 | TFLOPs: 40.51 | 15: iteration 121340/ 125429 | consumed samples: 31063040 | consumed tokens: 63617105920 | elapsed time per iteration (s): 1.04 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.846194E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.273 | TFLOPs: 40.53 | 15: iteration 121350/ 125429 | consumed samples: 31065600 | consumed tokens: 63622348800 | elapsed time per iteration (s): 1.04 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.868262E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.604 | TFLOPs: 40.75 | 15: iteration 121360/ 125429 | consumed samples: 31068160 | consumed tokens: 63627591680 | elapsed time per iteration (s): 1.18 | learning rate: 2.048E-05 | global batch size: 256 | lm loss: 1.900486E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.289 | TFLOPs: 35.74 | 15: iteration 121370/ 125429 | consumed samples: 31070720 | consumed tokens: 63632834560 | elapsed time per iteration (s): 1.02 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.878102E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.945 | TFLOPs: 41.31 | 15: iteration 121380/ 125429 | consumed samples: 31073280 | consumed tokens: 63638077440 | elapsed time per iteration (s): 1.03 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.892072E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.387 | TFLOPs: 41.21 | 15: iteration 121390/ 125429 | consumed samples: 31075840 | consumed tokens: 63643320320 | elapsed time per iteration (s): 1.03 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.929713E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.697 | TFLOPs: 41.10 | 15: iteration 121400/ 125429 | consumed samples: 31078400 | consumed tokens: 63648563200 | elapsed time per iteration (s): 1.02 | learning rate: 2.047E-05 | global batch size: 256 | lm loss: 1.901615E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.922 | TFLOPs: 41.47 | 15: iteration 121410/ 125429 | consumed samples: 31080960 | consumed tokens: 63653806080 | elapsed time per iteration (s): 1.04 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.890369E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.027 | TFLOPs: 40.82 | 15: iteration 121420/ 125429 | consumed samples: 31083520 | consumed tokens: 63659048960 | elapsed time per iteration (s): 1.02 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.885876E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.551 | TFLOPs: 41.57 | 15: iteration 121430/ 125429 | consumed samples: 31086080 | consumed tokens: 63664291840 | elapsed time per iteration (s): 1.02 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.871805E+00 | grad norm: 0.189 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.989 | TFLOPs: 41.48 | 15: iteration 121440/ 125429 | consumed samples: 31088640 | consumed tokens: 63669534720 | elapsed time per iteration (s): 1.02 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.884354E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.320 | TFLOPs: 41.37 | 15: iteration 121450/ 125429 | consumed samples: 31091200 | consumed tokens: 63674777600 | elapsed time per iteration (s): 1.19 | learning rate: 2.046E-05 | global batch size: 256 | lm loss: 1.890547E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.548 | TFLOPs: 35.46 | 15: iteration 121460/ 125429 | consumed samples: 31093760 | consumed tokens: 63680020480 | elapsed time per iteration (s): 1.03 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.881785E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.876 | TFLOPs: 41.13 | 15: iteration 121470/ 125429 | consumed samples: 31096320 | consumed tokens: 63685263360 | elapsed time per iteration (s): 1.03 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.866183E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.514 | TFLOPs: 40.90 | 15: iteration 121480/ 125429 | consumed samples: 31098880 | consumed tokens: 63690506240 | elapsed time per iteration (s): 1.19 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.874054E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.639 | TFLOPs: 35.64 | 15: iteration 121490/ 125429 | consumed samples: 31101440 | consumed tokens: 63695749120 | elapsed time per iteration (s): 1.05 | learning rate: 2.045E-05 | global batch size: 256 | lm loss: 1.909313E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.968 | TFLOPs: 40.32 | 15: iteration 121500/ 125429 | consumed samples: 31104000 | consumed tokens: 63700992000 | elapsed time per iteration (s): 1.05 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.896121E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.396 | TFLOPs: 40.39 | 15: iteration 121510/ 125429 | consumed samples: 31106560 | consumed tokens: 63706234880 | elapsed time per iteration (s): 1.03 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.877495E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.638 | TFLOPs: 41.25 | 15: iteration 121520/ 125429 | consumed samples: 31109120 | consumed tokens: 63711477760 | elapsed time per iteration (s): 1.03 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.869096E+00 | grad norm: 0.180 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.006 | TFLOPs: 41.15 | 15: iteration 121530/ 125429 | consumed samples: 31111680 | consumed tokens: 63716720640 | elapsed time per iteration (s): 1.07 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.895983E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.953 | TFLOPs: 39.65 | 15: iteration 121540/ 125429 | consumed samples: 31114240 | consumed tokens: 63721963520 | elapsed time per iteration (s): 1.07 | learning rate: 2.044E-05 | global batch size: 256 | lm loss: 1.908444E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.552 | TFLOPs: 39.59 | 15: iteration 121550/ 125429 | consumed samples: 31116800 | consumed tokens: 63727206400 | elapsed time per iteration (s): 1.06 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.919431E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.202 | TFLOPs: 39.86 | 15: iteration 121560/ 125429 | consumed samples: 31119360 | consumed tokens: 63732449280 | elapsed time per iteration (s): 1.05 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.891644E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.859 | TFLOPs: 40.13 | 15: iteration 121570/ 125429 | consumed samples: 31121920 | consumed tokens: 63737692160 | elapsed time per iteration (s): 1.03 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.901958E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.668 | TFLOPs: 41.26 | 15: iteration 121580/ 125429 | consumed samples: 31124480 | consumed tokens: 63742935040 | elapsed time per iteration (s): 1.04 | learning rate: 2.043E-05 | global batch size: 256 | lm loss: 1.905382E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.171 | TFLOPs: 40.52 | 15: iteration 121590/ 125429 | consumed samples: 31127040 | consumed tokens: 63748177920 | elapsed time per iteration (s): 1.03 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.906104E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.346 | TFLOPs: 41.21 | 15: iteration 121600/ 125429 | consumed samples: 31129600 | consumed tokens: 63753420800 | elapsed time per iteration (s): 1.03 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.899274E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.074 | TFLOPs: 41.00 | 15: iteration 121610/ 125429 | consumed samples: 31132160 | consumed tokens: 63758663680 | elapsed time per iteration (s): 1.05 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.887582E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.693 | TFLOPs: 40.44 | 15: iteration 121620/ 125429 | consumed samples: 31134720 | consumed tokens: 63763906560 | elapsed time per iteration (s): 1.03 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.900451E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.271 | TFLOPs: 41.19 | 15: iteration 121630/ 125429 | consumed samples: 31137280 | consumed tokens: 63769149440 | elapsed time per iteration (s): 1.03 | learning rate: 2.042E-05 | global batch size: 256 | lm loss: 1.883693E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.816 | TFLOPs: 41.12 | 15: iteration 121640/ 125429 | consumed samples: 31139840 | consumed tokens: 63774392320 | elapsed time per iteration (s): 1.03 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.903298E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.379 | TFLOPs: 41.21 | 15: iteration 121650/ 125429 | consumed samples: 31142400 | consumed tokens: 63779635200 | elapsed time per iteration (s): 1.08 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.884286E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.058 | TFLOPs: 39.18 | 15: iteration 121660/ 125429 | consumed samples: 31144960 | consumed tokens: 63784878080 | elapsed time per iteration (s): 1.04 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.876242E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.683 | TFLOPs: 40.77 | 15: iteration 121670/ 125429 | consumed samples: 31147520 | consumed tokens: 63790120960 | elapsed time per iteration (s): 1.03 | learning rate: 2.041E-05 | global batch size: 256 | lm loss: 1.854713E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.298 | TFLOPs: 41.03 | 15: iteration 121680/ 125429 | consumed samples: 31150080 | consumed tokens: 63795363840 | elapsed time per iteration (s): 1.03 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.894429E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.599 | TFLOPs: 41.25 | 15: iteration 121690/ 125429 | consumed samples: 31152640 | consumed tokens: 63800606720 | elapsed time per iteration (s): 1.10 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.882040E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.226 | TFLOPs: 38.38 | 15: iteration 121700/ 125429 | consumed samples: 31155200 | consumed tokens: 63805849600 | elapsed time per iteration (s): 1.04 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.882663E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.237 | TFLOPs: 40.53 | 15: iteration 121710/ 125429 | consumed samples: 31157760 | consumed tokens: 63811092480 | elapsed time per iteration (s): 1.03 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.919763E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.413 | TFLOPs: 41.22 | 15: iteration 121720/ 125429 | consumed samples: 31160320 | consumed tokens: 63816335360 | elapsed time per iteration (s): 1.04 | learning rate: 2.040E-05 | global batch size: 256 | lm loss: 1.902167E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.632 | TFLOPs: 40.76 | 15: iteration 121730/ 125429 | consumed samples: 31162880 | consumed tokens: 63821578240 | elapsed time per iteration (s): 1.07 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.865656E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.568 | TFLOPs: 39.59 | 15: iteration 121740/ 125429 | consumed samples: 31165440 | consumed tokens: 63826821120 | elapsed time per iteration (s): 1.03 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.884808E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.818 | TFLOPs: 40.95 | 15: iteration 121750/ 125429 | consumed samples: 31168000 | consumed tokens: 63832064000 | elapsed time per iteration (s): 1.05 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.889702E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.336 | TFLOPs: 40.38 | 15: iteration 121760/ 125429 | consumed samples: 31170560 | consumed tokens: 63837306880 | elapsed time per iteration (s): 1.02 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.891054E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.838 | TFLOPs: 41.45 | 15: iteration 121770/ 125429 | consumed samples: 31173120 | consumed tokens: 63842549760 | elapsed time per iteration (s): 1.07 | learning rate: 2.039E-05 | global batch size: 256 | lm loss: 1.925630E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.028 | TFLOPs: 39.67 | 15: iteration 121780/ 125429 | consumed samples: 31175680 | consumed tokens: 63847792640 | elapsed time per iteration (s): 1.06 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.868797E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.392 | TFLOPs: 39.89 | 15: iteration 121790/ 125429 | consumed samples: 31178240 | consumed tokens: 63853035520 | elapsed time per iteration (s): 1.08 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.874329E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.308 | TFLOPs: 39.22 | 15: iteration 121800/ 125429 | consumed samples: 31180800 | consumed tokens: 63858278400 | elapsed time per iteration (s): 1.02 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.903901E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.565 | TFLOPs: 41.41 | 15: iteration 121810/ 125429 | consumed samples: 31183360 | consumed tokens: 63863521280 | elapsed time per iteration (s): 1.03 | learning rate: 2.038E-05 | global batch size: 256 | lm loss: 1.879243E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.671 | TFLOPs: 40.93 | 15: iteration 121820/ 125429 | consumed samples: 31185920 | consumed tokens: 63868764160 | elapsed time per iteration (s): 1.02 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.924165E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.758 | TFLOPs: 41.27 | 15: iteration 121830/ 125429 | consumed samples: 31188480 | consumed tokens: 63874007040 | elapsed time per iteration (s): 1.07 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.897779E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.616 | TFLOPs: 39.43 | 15: iteration 121840/ 125429 | consumed samples: 31191040 | consumed tokens: 63879249920 | elapsed time per iteration (s): 1.09 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.894008E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.796 | TFLOPs: 38.97 | 15: iteration 121850/ 125429 | consumed samples: 31193600 | consumed tokens: 63884492800 | elapsed time per iteration (s): 1.07 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.890189E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.179 | TFLOPs: 39.69 | 15: iteration 121860/ 125429 | consumed samples: 31196160 | consumed tokens: 63889735680 | elapsed time per iteration (s): 1.05 | learning rate: 2.037E-05 | global batch size: 256 | lm loss: 1.895491E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.896 | TFLOPs: 40.31 | 15: iteration 121870/ 125429 | consumed samples: 31198720 | consumed tokens: 63894978560 | elapsed time per iteration (s): 1.03 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.897909E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.186 | TFLOPs: 41.01 | 15: iteration 121880/ 125429 | consumed samples: 31201280 | consumed tokens: 63900221440 | elapsed time per iteration (s): 1.05 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.909055E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.077 | TFLOPs: 40.17 | 15: iteration 121890/ 125429 | consumed samples: 31203840 | consumed tokens: 63905464320 | elapsed time per iteration (s): 1.04 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.895202E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.001 | TFLOPs: 40.49 | 15: iteration 121900/ 125429 | consumed samples: 31206400 | consumed tokens: 63910707200 | elapsed time per iteration (s): 1.03 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.863953E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.773 | TFLOPs: 40.95 | 15: iteration 121910/ 125429 | consumed samples: 31208960 | consumed tokens: 63915950080 | elapsed time per iteration (s): 1.04 | learning rate: 2.036E-05 | global batch size: 256 | lm loss: 1.869666E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.795 | TFLOPs: 40.78 | 15: iteration 121920/ 125429 | consumed samples: 31211520 | consumed tokens: 63921192960 | elapsed time per iteration (s): 1.06 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.872523E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.969 | TFLOPs: 39.99 | 15: iteration 121930/ 125429 | consumed samples: 31214080 | consumed tokens: 63926435840 | elapsed time per iteration (s): 1.04 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.916011E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.089 | TFLOPs: 40.50 | 15: iteration 121940/ 125429 | consumed samples: 31216640 | consumed tokens: 63931678720 | elapsed time per iteration (s): 1.03 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.888751E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.874 | TFLOPs: 40.96 | 15: iteration 121950/ 125429 | consumed samples: 31219200 | consumed tokens: 63936921600 | elapsed time per iteration (s): 1.04 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.922396E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.460 | TFLOPs: 40.73 | 15: iteration 121960/ 125429 | consumed samples: 31221760 | consumed tokens: 63942164480 | elapsed time per iteration (s): 1.03 | learning rate: 2.035E-05 | global batch size: 256 | lm loss: 1.859821E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.975 | TFLOPs: 40.98 | 15: iteration 121970/ 125429 | consumed samples: 31224320 | consumed tokens: 63947407360 | elapsed time per iteration (s): 1.04 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.881490E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.515 | TFLOPs: 40.57 | 15: iteration 121980/ 125429 | consumed samples: 31226880 | consumed tokens: 63952650240 | elapsed time per iteration (s): 1.02 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.888685E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.001 | TFLOPs: 41.31 | 15: iteration 121990/ 125429 | consumed samples: 31229440 | consumed tokens: 63957893120 | elapsed time per iteration (s): 1.04 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.886617E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.001 | TFLOPs: 40.49 | 0: [2022-11-27 08:08:29,170] [INFO] [logging.py:68:log_dist] [Rank 0] step=122000, skipped=0, lr=[2.0338472151837915e-05, 2.0338472151837915e-05, 2.0338472151837915e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 122000/ 125429 | consumed samples: 31232000 | consumed tokens: 63963136000 | elapsed time per iteration (s): 1.03 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.847450E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.295 | TFLOPs: 41.03 | 0: steps: 122000 loss: 1.8662 iter time (s): 1.049 samples/sec: 243.990 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 122000 | lm loss value: 2.002956E+00 | lm loss PPL: 7.410930E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 122000 to checkpoints_1b5 0: [2022-11-27 08:08:29,546] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step122000 is begin to save! 0: [2022-11-27 08:08:29,553] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_01-model_00-model_states.pt... 0: [2022-11-27 08:08:29,812] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_01-model_00-model_states.pt. 0: [2022-11-27 08:08:29,812] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_03-model_00-model_states.pt... 0: [2022-11-27 08:08:30,060] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_03-model_00-model_states.pt. 0: [2022-11-27 08:08:30,061] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_04-model_00-model_states.pt... 0: [2022-11-27 08:08:30,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_04-model_00-model_states.pt. 0: [2022-11-27 08:08:30,192] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_05-model_00-model_states.pt... 0: [2022-11-27 08:08:30,305] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_05-model_00-model_states.pt. 0: [2022-11-27 08:08:30,306] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_06-model_00-model_states.pt... 0: [2022-11-27 08:08:30,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_06-model_00-model_states.pt. 0: [2022-11-27 08:08:30,418] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_07-model_00-model_states.pt... 0: [2022-11-27 08:08:30,532] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_07-model_00-model_states.pt. 0: [2022-11-27 08:08:30,533] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_08-model_00-model_states.pt... 0: [2022-11-27 08:08:30,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_08-model_00-model_states.pt. 0: [2022-11-27 08:08:30,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_09-model_00-model_states.pt... 0: [2022-11-27 08:08:30,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_09-model_00-model_states.pt. 0: [2022-11-27 08:08:30,763] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_10-model_00-model_states.pt... 0: [2022-11-27 08:08:30,864] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_10-model_00-model_states.pt. 0: [2022-11-27 08:08:30,865] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_11-model_00-model_states.pt... 0: [2022-11-27 08:08:30,975] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_11-model_00-model_states.pt. 0: [2022-11-27 08:08:30,976] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_12-model_00-model_states.pt... 0: [2022-11-27 08:08:31,080] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_12-model_00-model_states.pt. 0: [2022-11-27 08:08:31,080] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_13-model_00-model_states.pt... 0: [2022-11-27 08:08:31,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_13-model_00-model_states.pt. 0: [2022-11-27 08:08:31,189] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_14-model_00-model_states.pt... 0: [2022-11-27 08:08:31,293] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_14-model_00-model_states.pt. 0: [2022-11-27 08:08:31,293] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_15-model_00-model_states.pt... 0: [2022-11-27 08:08:31,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_15-model_00-model_states.pt. 0: [2022-11-27 08:08:31,402] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_16-model_00-model_states.pt... 0: [2022-11-27 08:08:31,510] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_16-model_00-model_states.pt. 0: [2022-11-27 08:08:31,511] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_17-model_00-model_states.pt... 0: [2022-11-27 08:08:31,624] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_17-model_00-model_states.pt. 0: [2022-11-27 08:08:31,625] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_18-model_00-model_states.pt... 0: [2022-11-27 08:08:31,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_18-model_00-model_states.pt. 0: [2022-11-27 08:08:31,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_19-model_00-model_states.pt... 0: [2022-11-27 08:08:31,853] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_19-model_00-model_states.pt. 0: [2022-11-27 08:08:31,853] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_20-model_00-model_states.pt... 0: [2022-11-27 08:08:31,968] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_20-model_00-model_states.pt. 0: [2022-11-27 08:08:31,968] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_21-model_00-model_states.pt... 0: [2022-11-27 08:08:32,084] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_21-model_00-model_states.pt. 0: [2022-11-27 08:08:32,084] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_22-model_00-model_states.pt... 0: [2022-11-27 08:08:32,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_22-model_00-model_states.pt. 0: [2022-11-27 08:08:32,197] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_23-model_00-model_states.pt... 0: [2022-11-27 08:08:32,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_23-model_00-model_states.pt. 0: [2022-11-27 08:08:32,312] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_24-model_00-model_states.pt... 0: [2022-11-27 08:08:32,425] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_24-model_00-model_states.pt. 0: [2022-11-27 08:08:32,425] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_25-model_00-model_states.pt... 0: [2022-11-27 08:08:32,536] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_25-model_00-model_states.pt. 0: [2022-11-27 08:08:32,536] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_26-model_00-model_states.pt... 0: [2022-11-27 08:08:32,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_26-model_00-model_states.pt. 0: [2022-11-27 08:08:32,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_27-model_00-model_states.pt... 0: [2022-11-27 08:08:32,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_27-model_00-model_states.pt. 0: [2022-11-27 08:08:32,761] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_28-model_00-model_states.pt... 0: [2022-11-27 08:08:32,875] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_28-model_00-model_states.pt. 0: [2022-11-27 08:08:32,875] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_29-model_00-model_states.pt... 0: [2022-11-27 08:08:32,990] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_29-model_00-model_states.pt. 0: [2022-11-27 08:08:32,991] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_30-model_00-model_states.pt... 0: [2022-11-27 08:08:33,102] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_30-model_00-model_states.pt. 0: [2022-11-27 08:08:33,102] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/layer_32-model_00-model_states.pt... 0: [2022-11-27 08:08:33,107] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/layer_32-model_00-model_states.pt. 0: [2022-11-27 08:08:33,108] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step122000/mp_rank_00_model_states.pt 0: [2022-11-27 08:08:33,108] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/mp_rank_00_model_states.pt... 0: [2022-11-27 08:08:33,112] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/mp_rank_00_model_states.pt. 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:08:33,151] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step122000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:08:33,313] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,314] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,314] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,316] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:08:33,316] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,316] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,319] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,319] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,320] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-27 08:08:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:08:33,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-27 08:08:33,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-27 08:08:33,323] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,323] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,323] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,324] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:08:33,324] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 08:08:33,324] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:08:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-27 08:08:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,329] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,329] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,329] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:08:33,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:08:33,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,332] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,332] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-27 08:08:33,336] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,336] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,336] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:08:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-27 08:08:33,327] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,327] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,327] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 14: [2022-11-27 08:08:33,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 08:08:33,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-27 08:08:33,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,335] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 08:08:33,335] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 08:08:33,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-27 08:08:33,335] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,342] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,342] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,342] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:08:33,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,321] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,321] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-27 08:08:33,321] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,322] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,322] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-27 08:08:33,322] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,338] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,334] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-27 08:08:33,338] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-27 08:08:33,334] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,343] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,343] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-27 08:08:33,343] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:08:33,346] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,346] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-27 08:08:33,346] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-27 08:08:33,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,350] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-27 08:08:33,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,350] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:08:33,351] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,351] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-27 08:08:33,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,352] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,352] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,352] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,353] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,353] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,353] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 12: [2022-11-27 08:08:33,354] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:08:33,354] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,354] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,355] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,355] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,355] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-27 08:08:33,341] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,341] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,341] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,358] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:08:33,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,359] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-27 08:08:33,359] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 5: [2022-11-27 08:08:33,361] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:08:33,361] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 08:08:33,361] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-27 08:08:33,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,344] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,344] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,347] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:08:33,347] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,347] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,348] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:08:33,348] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,348] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 1: [2022-11-27 08:08:33,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:08:33,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 08:08:33,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,330] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,369] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:08:33,330] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,369] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 13: [2022-11-27 08:08:33,339] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,369] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,339] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,345] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:08:33,345] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 08:08:33,345] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:08:33,349] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:08:33,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 08:08:33,349] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 08:08:33,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,350] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:08:33,357] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 08:08:33,357] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,371] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,371] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,372] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 7: [2022-11-27 08:08:33,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:08:33,374] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 08:08:33,374] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 14: [2022-11-27 08:08:33,375] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:08:33,375] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 08:08:33,375] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 2: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:08:33,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:08:33,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,377] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 10: [2022-11-27 08:08:33,377] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 3: [2022-11-27 08:08:33,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:08:33,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 08:08:33,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 8: [2022-11-27 08:08:33,379] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:08:33,379] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 08:08:33,379] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,390] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,390] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:08:33,401] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:08:33,401] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 08:08:33,401] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:08:33,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,393] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,393] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,393] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,363] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,363] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,363] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-27 08:08:33,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,364] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,364] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 4: [2022-11-27 08:08:33,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:08:33,382] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 08:08:33,382] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 13: [2022-11-27 08:08:33,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 08:08:33,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:08:33,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 9: [2022-11-27 08:08:33,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:08:33,436] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 08:08:33,436] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,444] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,444] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 6: [2022-11-27 08:08:33,449] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:08:33,449] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 08:08:33,449] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 11: [2022-11-27 08:08:33,476] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:08:33,476] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 08:08:33,476] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: [2022-11-27 08:08:33,504] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 08:08:33,504] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 15: [2022-11-27 08:08:33,625] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step122000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 08:08:33,625] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step122000 is ready now! 0: successfully saved checkpoint at iteration 122000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4115.74 15: iteration 122010/ 125429 | consumed samples: 31234560 | consumed tokens: 63968378880 | elapsed time per iteration (s): 1.50 | learning rate: 2.034E-05 | global batch size: 256 | lm loss: 1.904028E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 171.237 | TFLOPs: 28.30 | 15: iteration 122020/ 125429 | consumed samples: 31237120 | consumed tokens: 63973621760 | elapsed time per iteration (s): 1.05 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.864888E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.317 | TFLOPs: 40.21 | 15: iteration 122030/ 125429 | consumed samples: 31239680 | consumed tokens: 63978864640 | elapsed time per iteration (s): 1.03 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.881836E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.867 | TFLOPs: 40.96 | 15: iteration 122040/ 125429 | consumed samples: 31242240 | consumed tokens: 63984107520 | elapsed time per iteration (s): 1.04 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.867174E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.382 | TFLOPs: 40.55 | 15: iteration 122050/ 125429 | consumed samples: 31244800 | consumed tokens: 63989350400 | elapsed time per iteration (s): 1.04 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.915621E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.760 | TFLOPs: 40.61 | 15: iteration 122060/ 125429 | consumed samples: 31247360 | consumed tokens: 63994593280 | elapsed time per iteration (s): 1.04 | learning rate: 2.033E-05 | global batch size: 256 | lm loss: 1.905132E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.691 | TFLOPs: 40.77 | 15: iteration 122070/ 125429 | consumed samples: 31249920 | consumed tokens: 63999836160 | elapsed time per iteration (s): 1.04 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.876434E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.421 | TFLOPs: 40.72 | 15: iteration 122080/ 125429 | consumed samples: 31252480 | consumed tokens: 64005079040 | elapsed time per iteration (s): 1.06 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.874538E+00 | grad norm: 0.190 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.499 | TFLOPs: 40.07 | 15: iteration 122090/ 125429 | consumed samples: 31255040 | consumed tokens: 64010321920 | elapsed time per iteration (s): 1.07 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.886906E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.423 | TFLOPs: 39.57 | 15: iteration 122100/ 125429 | consumed samples: 31257600 | consumed tokens: 64015564800 | elapsed time per iteration (s): 1.06 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.873503E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.723 | TFLOPs: 39.78 | 15: iteration 122110/ 125429 | consumed samples: 31260160 | consumed tokens: 64020807680 | elapsed time per iteration (s): 1.10 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.866141E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.034 | TFLOPs: 38.35 | 15: iteration 122120/ 125429 | consumed samples: 31262720 | consumed tokens: 64026050560 | elapsed time per iteration (s): 1.03 | learning rate: 2.032E-05 | global batch size: 256 | lm loss: 1.891442E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.351 | TFLOPs: 41.04 | 15: iteration 122130/ 125429 | consumed samples: 31265280 | consumed tokens: 64031293440 | elapsed time per iteration (s): 1.04 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.892393E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.463 | TFLOPs: 40.56 | 15: iteration 122140/ 125429 | consumed samples: 31267840 | consumed tokens: 64036536320 | elapsed time per iteration (s): 1.05 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.886597E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.389 | TFLOPs: 40.22 | 15: iteration 122150/ 125429 | consumed samples: 31270400 | consumed tokens: 64041779200 | elapsed time per iteration (s): 1.06 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.873514E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.586 | TFLOPs: 39.92 | 15: iteration 122160/ 125429 | consumed samples: 31272960 | consumed tokens: 64047022080 | elapsed time per iteration (s): 1.10 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.871848E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.496 | TFLOPs: 38.42 | 15: iteration 122170/ 125429 | consumed samples: 31275520 | consumed tokens: 64052264960 | elapsed time per iteration (s): 1.03 | learning rate: 2.031E-05 | global batch size: 256 | lm loss: 1.891724E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.165 | TFLOPs: 41.18 | 15: iteration 122180/ 125429 | consumed samples: 31278080 | consumed tokens: 64057507840 | elapsed time per iteration (s): 1.04 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.894030E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.910 | TFLOPs: 40.80 | 15: iteration 122190/ 125429 | consumed samples: 31280640 | consumed tokens: 64062750720 | elapsed time per iteration (s): 1.09 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.870881E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.557 | TFLOPs: 38.76 | 15: iteration 122200/ 125429 | consumed samples: 31283200 | consumed tokens: 64067993600 | elapsed time per iteration (s): 1.04 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.882369E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.325 | TFLOPs: 40.71 | 15: iteration 122210/ 125429 | consumed samples: 31285760 | consumed tokens: 64073236480 | elapsed time per iteration (s): 1.12 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.897016E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.769 | TFLOPs: 37.81 | 15: iteration 122220/ 125429 | consumed samples: 31288320 | consumed tokens: 64078479360 | elapsed time per iteration (s): 1.03 | learning rate: 2.030E-05 | global batch size: 256 | lm loss: 1.870754E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.047 | TFLOPs: 41.16 | 15: iteration 122230/ 125429 | consumed samples: 31290880 | consumed tokens: 64083722240 | elapsed time per iteration (s): 1.06 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.913686E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.764 | TFLOPs: 39.95 | 15: iteration 122240/ 125429 | consumed samples: 31293440 | consumed tokens: 64088965120 | elapsed time per iteration (s): 1.04 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.883258E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.401 | TFLOPs: 40.72 | 15: iteration 122250/ 125429 | consumed samples: 31296000 | consumed tokens: 64094208000 | elapsed time per iteration (s): 1.06 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.888388E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.507 | TFLOPs: 39.91 | 15: iteration 122260/ 125429 | consumed samples: 31298560 | consumed tokens: 64099450880 | elapsed time per iteration (s): 1.08 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.879764E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.286 | TFLOPs: 39.05 | 15: iteration 122270/ 125429 | consumed samples: 31301120 | consumed tokens: 64104693760 | elapsed time per iteration (s): 1.13 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.904427E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.617 | TFLOPs: 37.45 | 15: iteration 122280/ 125429 | consumed samples: 31303680 | consumed tokens: 64109936640 | elapsed time per iteration (s): 1.06 | learning rate: 2.029E-05 | global batch size: 256 | lm loss: 1.874234E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.700 | TFLOPs: 39.94 | 15: iteration 122290/ 125429 | consumed samples: 31306240 | consumed tokens: 64115179520 | elapsed time per iteration (s): 1.03 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.883026E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.387 | TFLOPs: 41.05 | 15: iteration 122300/ 125429 | consumed samples: 31308800 | consumed tokens: 64120422400 | elapsed time per iteration (s): 1.05 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.896448E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.147 | TFLOPs: 40.35 | 15: iteration 122310/ 125429 | consumed samples: 31311360 | consumed tokens: 64125665280 | elapsed time per iteration (s): 1.04 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.901773E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.477 | TFLOPs: 40.73 | 15: iteration 122320/ 125429 | consumed samples: 31313920 | consumed tokens: 64130908160 | elapsed time per iteration (s): 1.03 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.889091E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.808 | TFLOPs: 40.95 | 15: iteration 122330/ 125429 | consumed samples: 31316480 | consumed tokens: 64136151040 | elapsed time per iteration (s): 1.07 | learning rate: 2.028E-05 | global batch size: 256 | lm loss: 1.875529E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.298 | TFLOPs: 39.71 | 15: iteration 122340/ 125429 | consumed samples: 31319040 | consumed tokens: 64141393920 | elapsed time per iteration (s): 1.04 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.862817E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.144 | TFLOPs: 40.68 | 15: iteration 122350/ 125429 | consumed samples: 31321600 | consumed tokens: 64146636800 | elapsed time per iteration (s): 1.10 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.888052E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.011 | TFLOPs: 38.34 | 15: iteration 122360/ 125429 | consumed samples: 31324160 | consumed tokens: 64151879680 | elapsed time per iteration (s): 1.04 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.909424E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.165 | TFLOPs: 40.68 | 15: iteration 122370/ 125429 | consumed samples: 31326720 | consumed tokens: 64157122560 | elapsed time per iteration (s): 1.03 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.920977E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.066 | TFLOPs: 41.16 | 15: iteration 122380/ 125429 | consumed samples: 31329280 | consumed tokens: 64162365440 | elapsed time per iteration (s): 1.02 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.884051E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.953 | TFLOPs: 41.31 | 15: iteration 122390/ 125429 | consumed samples: 31331840 | consumed tokens: 64167608320 | elapsed time per iteration (s): 1.07 | learning rate: 2.027E-05 | global batch size: 256 | lm loss: 1.871233E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.130 | TFLOPs: 39.68 | 15: iteration 122400/ 125429 | consumed samples: 31334400 | consumed tokens: 64172851200 | elapsed time per iteration (s): 1.07 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.912297E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.965 | TFLOPs: 39.66 | 15: iteration 122410/ 125429 | consumed samples: 31336960 | consumed tokens: 64178094080 | elapsed time per iteration (s): 1.06 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.885381E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.541 | TFLOPs: 39.92 | 15: iteration 122420/ 125429 | consumed samples: 31339520 | consumed tokens: 64183336960 | elapsed time per iteration (s): 1.03 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.881560E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.019 | TFLOPs: 40.99 | 15: iteration 122430/ 125429 | consumed samples: 31342080 | consumed tokens: 64188579840 | elapsed time per iteration (s): 1.03 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.897075E+00 | grad norm: 0.185 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.971 | TFLOPs: 40.98 | 15: iteration 122440/ 125429 | consumed samples: 31344640 | consumed tokens: 64193822720 | elapsed time per iteration (s): 1.11 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.879752E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.032 | TFLOPs: 38.01 | 15: iteration 122450/ 125429 | consumed samples: 31347200 | consumed tokens: 64199065600 | elapsed time per iteration (s): 1.04 | learning rate: 2.026E-05 | global batch size: 256 | lm loss: 1.904127E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.693 | TFLOPs: 40.77 | 15: iteration 122460/ 125429 | consumed samples: 31349760 | consumed tokens: 64204308480 | elapsed time per iteration (s): 1.05 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.873365E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.138 | TFLOPs: 40.18 | 15: iteration 122470/ 125429 | consumed samples: 31352320 | consumed tokens: 64209551360 | elapsed time per iteration (s): 1.08 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.925482E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.266 | TFLOPs: 39.21 | 15: iteration 122480/ 125429 | consumed samples: 31354880 | consumed tokens: 64214794240 | elapsed time per iteration (s): 1.09 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.898785E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.764 | TFLOPs: 38.96 | 15: iteration 122490/ 125429 | consumed samples: 31357440 | consumed tokens: 64220037120 | elapsed time per iteration (s): 1.04 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.899059E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.018 | TFLOPs: 40.66 | 15: iteration 122500/ 125429 | consumed samples: 31360000 | consumed tokens: 64225280000 | elapsed time per iteration (s): 1.02 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.868620E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.301 | TFLOPs: 41.53 | 15: iteration 122510/ 125429 | consumed samples: 31362560 | consumed tokens: 64230522880 | elapsed time per iteration (s): 1.07 | learning rate: 2.025E-05 | global batch size: 256 | lm loss: 1.906155E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.282 | TFLOPs: 39.71 | 15: iteration 122520/ 125429 | consumed samples: 31365120 | consumed tokens: 64235765760 | elapsed time per iteration (s): 1.05 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.895776E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.727 | TFLOPs: 40.44 | 15: iteration 122530/ 125429 | consumed samples: 31367680 | consumed tokens: 64241008640 | elapsed time per iteration (s): 1.05 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.895212E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.332 | TFLOPs: 40.38 | 15: iteration 122540/ 125429 | consumed samples: 31370240 | consumed tokens: 64246251520 | elapsed time per iteration (s): 1.06 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.883245E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.362 | TFLOPs: 39.89 | 15: iteration 122550/ 125429 | consumed samples: 31372800 | consumed tokens: 64251494400 | elapsed time per iteration (s): 1.04 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.894722E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.835 | TFLOPs: 40.63 | 15: iteration 122560/ 125429 | consumed samples: 31375360 | consumed tokens: 64256737280 | elapsed time per iteration (s): 1.03 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.883714E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.473 | TFLOPs: 40.90 | 15: iteration 122570/ 125429 | consumed samples: 31377920 | consumed tokens: 64261980160 | elapsed time per iteration (s): 1.05 | learning rate: 2.024E-05 | global batch size: 256 | lm loss: 1.903695E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.749 | TFLOPs: 40.45 | 15: iteration 122580/ 125429 | consumed samples: 31380480 | consumed tokens: 64267223040 | elapsed time per iteration (s): 1.09 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.874043E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.505 | TFLOPs: 38.92 | 15: iteration 122590/ 125429 | consumed samples: 31383040 | consumed tokens: 64272465920 | elapsed time per iteration (s): 1.04 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.888791E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.594 | TFLOPs: 40.59 | 15: iteration 122600/ 125429 | consumed samples: 31385600 | consumed tokens: 64277708800 | elapsed time per iteration (s): 1.03 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.904875E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.437 | TFLOPs: 41.22 | 15: iteration 122610/ 125429 | consumed samples: 31388160 | consumed tokens: 64282951680 | elapsed time per iteration (s): 1.04 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.878465E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.511 | TFLOPs: 40.57 | 15: iteration 122620/ 125429 | consumed samples: 31390720 | consumed tokens: 64288194560 | elapsed time per iteration (s): 1.02 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.887325E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.643 | TFLOPs: 41.59 | 15: iteration 122630/ 125429 | consumed samples: 31393280 | consumed tokens: 64293437440 | elapsed time per iteration (s): 1.08 | learning rate: 2.023E-05 | global batch size: 256 | lm loss: 1.850865E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.794 | TFLOPs: 39.13 | 15: iteration 122640/ 125429 | consumed samples: 31395840 | consumed tokens: 64298680320 | elapsed time per iteration (s): 1.08 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.868628E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.134 | TFLOPs: 39.35 | 15: iteration 122650/ 125429 | consumed samples: 31398400 | consumed tokens: 64303923200 | elapsed time per iteration (s): 1.05 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.884743E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.845 | TFLOPs: 40.46 | 15: iteration 122660/ 125429 | consumed samples: 31400960 | consumed tokens: 64309166080 | elapsed time per iteration (s): 1.08 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.874619E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.908 | TFLOPs: 39.15 | 15: iteration 122670/ 125429 | consumed samples: 31403520 | consumed tokens: 64314408960 | elapsed time per iteration (s): 1.09 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.864950E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.766 | TFLOPs: 38.80 | 15: iteration 122680/ 125429 | consumed samples: 31406080 | consumed tokens: 64319651840 | elapsed time per iteration (s): 1.04 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.912556E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.516 | TFLOPs: 40.57 | 15: iteration 122690/ 125429 | consumed samples: 31408640 | consumed tokens: 64324894720 | elapsed time per iteration (s): 1.08 | learning rate: 2.022E-05 | global batch size: 256 | lm loss: 1.866146E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.324 | TFLOPs: 39.22 | 15: iteration 122700/ 125429 | consumed samples: 31411200 | consumed tokens: 64330137600 | elapsed time per iteration (s): 1.04 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.905569E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.072 | TFLOPs: 40.50 | 15: iteration 122710/ 125429 | consumed samples: 31413760 | consumed tokens: 64335380480 | elapsed time per iteration (s): 1.04 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.893929E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.065 | TFLOPs: 40.66 | 15: iteration 122720/ 125429 | consumed samples: 31416320 | consumed tokens: 64340623360 | elapsed time per iteration (s): 1.07 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.899857E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.219 | TFLOPs: 39.53 | 15: iteration 122730/ 125429 | consumed samples: 31418880 | consumed tokens: 64345866240 | elapsed time per iteration (s): 1.03 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.863877E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.558 | TFLOPs: 41.08 | 15: iteration 122740/ 125429 | consumed samples: 31421440 | consumed tokens: 64351109120 | elapsed time per iteration (s): 1.06 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.888112E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.815 | TFLOPs: 39.96 | 15: iteration 122750/ 125429 | consumed samples: 31424000 | consumed tokens: 64356352000 | elapsed time per iteration (s): 1.06 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.850343E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.171 | TFLOPs: 40.02 | 15: iteration 122760/ 125429 | consumed samples: 31426560 | consumed tokens: 64361594880 | elapsed time per iteration (s): 1.06 | learning rate: 2.021E-05 | global batch size: 256 | lm loss: 1.914132E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.870 | TFLOPs: 39.97 | 15: iteration 122770/ 125429 | consumed samples: 31429120 | consumed tokens: 64366837760 | elapsed time per iteration (s): 1.09 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.896360E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.551 | TFLOPs: 38.76 | 15: iteration 122780/ 125429 | consumed samples: 31431680 | consumed tokens: 64372080640 | elapsed time per iteration (s): 1.04 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.899266E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.252 | TFLOPs: 40.70 | 15: iteration 122790/ 125429 | consumed samples: 31434240 | consumed tokens: 64377323520 | elapsed time per iteration (s): 1.08 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.901933E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.115 | TFLOPs: 39.35 | 15: iteration 122800/ 125429 | consumed samples: 31436800 | consumed tokens: 64382566400 | elapsed time per iteration (s): 1.06 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.876515E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.203 | TFLOPs: 39.86 | 15: iteration 122810/ 125429 | consumed samples: 31439360 | consumed tokens: 64387809280 | elapsed time per iteration (s): 1.06 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.914794E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.159 | TFLOPs: 39.85 | 15: iteration 122820/ 125429 | consumed samples: 31441920 | consumed tokens: 64393052160 | elapsed time per iteration (s): 1.13 | learning rate: 2.020E-05 | global batch size: 256 | lm loss: 1.896661E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 227.148 | TFLOPs: 37.54 | 15: iteration 122830/ 125429 | consumed samples: 31444480 | consumed tokens: 64398295040 | elapsed time per iteration (s): 1.05 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.900916E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.968 | TFLOPs: 40.32 | 15: iteration 122840/ 125429 | consumed samples: 31447040 | consumed tokens: 64403537920 | elapsed time per iteration (s): 1.09 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.877527E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.055 | TFLOPs: 38.84 | 15: iteration 122850/ 125429 | consumed samples: 31449600 | consumed tokens: 64408780800 | elapsed time per iteration (s): 1.05 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.885192E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.082 | TFLOPs: 40.34 | 15: iteration 122860/ 125429 | consumed samples: 31452160 | consumed tokens: 64414023680 | elapsed time per iteration (s): 1.05 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.901599E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.744 | TFLOPs: 40.45 | 15: iteration 122870/ 125429 | consumed samples: 31454720 | consumed tokens: 64419266560 | elapsed time per iteration (s): 1.04 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.910914E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.192 | TFLOPs: 40.69 | 15: iteration 122880/ 125429 | consumed samples: 31457280 | consumed tokens: 64424509440 | elapsed time per iteration (s): 1.06 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.877896E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.732 | TFLOPs: 39.95 | 15: iteration 122890/ 125429 | consumed samples: 31459840 | consumed tokens: 64429752320 | elapsed time per iteration (s): 1.10 | learning rate: 2.019E-05 | global batch size: 256 | lm loss: 1.916042E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.968 | TFLOPs: 38.50 | 15: iteration 122900/ 125429 | consumed samples: 31462400 | consumed tokens: 64434995200 | elapsed time per iteration (s): 1.04 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.883260E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.088 | TFLOPs: 40.83 | 15: iteration 122910/ 125429 | consumed samples: 31464960 | consumed tokens: 64440238080 | elapsed time per iteration (s): 1.06 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.897921E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.882 | TFLOPs: 39.81 | 15: iteration 122920/ 125429 | consumed samples: 31467520 | consumed tokens: 64445480960 | elapsed time per iteration (s): 1.08 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.916058E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.465 | TFLOPs: 39.24 | 15: iteration 122930/ 125429 | consumed samples: 31470080 | consumed tokens: 64450723840 | elapsed time per iteration (s): 1.03 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.910765E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.458 | TFLOPs: 41.22 | 15: iteration 122940/ 125429 | consumed samples: 31472640 | consumed tokens: 64455966720 | elapsed time per iteration (s): 1.05 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.886548E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.449 | TFLOPs: 40.40 | 15: iteration 122950/ 125429 | consumed samples: 31475200 | consumed tokens: 64461209600 | elapsed time per iteration (s): 5.11 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.891251E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 50.062 | TFLOPs: 8.27 | 15: iteration 122960/ 125429 | consumed samples: 31477760 | consumed tokens: 64466452480 | elapsed time per iteration (s): 1.07 | learning rate: 2.018E-05 | global batch size: 256 | lm loss: 1.910456E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.532 | TFLOPs: 39.58 | 15: iteration 122970/ 125429 | consumed samples: 31480320 | consumed tokens: 64471695360 | elapsed time per iteration (s): 1.18 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.875677E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 216.346 | TFLOPs: 35.75 | 15: iteration 122980/ 125429 | consumed samples: 31482880 | consumed tokens: 64476938240 | elapsed time per iteration (s): 1.06 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.888200E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.584 | TFLOPs: 40.09 | 15: iteration 122990/ 125429 | consumed samples: 31485440 | consumed tokens: 64482181120 | elapsed time per iteration (s): 7.04 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.885808E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 36.380 | TFLOPs: 6.01 | 15: iteration 123000/ 125429 | consumed samples: 31488000 | consumed tokens: 64487424000 | elapsed time per iteration (s): 1.03 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.880633E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.841 | TFLOPs: 41.12 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 123000 | lm loss value: 1.835768E+00 | lm loss PPL: 6.269951E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 123000 to checkpoints_1b5 0: [2022-11-27 08:27:52,160] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step123000 is begin to save! 0: [2022-11-27 08:27:52,169] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_01-model_00-model_states.pt... 0: [2022-11-27 08:27:52,436] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_01-model_00-model_states.pt. 0: [2022-11-27 08:27:52,436] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_03-model_00-model_states.pt... 0: [2022-11-27 08:27:52,546] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_03-model_00-model_states.pt. 0: [2022-11-27 08:27:52,547] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_04-model_00-model_states.pt... 0: [2022-11-27 08:27:52,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_04-model_00-model_states.pt. 0: [2022-11-27 08:27:52,672] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_05-model_00-model_states.pt... 0: [2022-11-27 08:27:52,782] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_05-model_00-model_states.pt. 0: [2022-11-27 08:27:52,783] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_06-model_00-model_states.pt... 0: [2022-11-27 08:27:52,899] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_06-model_00-model_states.pt. 0: [2022-11-27 08:27:52,900] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_07-model_00-model_states.pt... 0: [2022-11-27 08:27:53,017] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_07-model_00-model_states.pt. 0: [2022-11-27 08:27:53,017] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_08-model_00-model_states.pt... 0: [2022-11-27 08:27:53,132] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_08-model_00-model_states.pt. 0: [2022-11-27 08:27:53,132] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_09-model_00-model_states.pt... 0: [2022-11-27 08:27:53,247] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_09-model_00-model_states.pt. 0: [2022-11-27 08:27:53,247] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_10-model_00-model_states.pt... 0: [2022-11-27 08:27:53,364] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_10-model_00-model_states.pt. 0: [2022-11-27 08:27:53,365] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_11-model_00-model_states.pt... 0: [2022-11-27 08:27:53,481] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_11-model_00-model_states.pt. 0: [2022-11-27 08:27:53,481] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_12-model_00-model_states.pt... 0: [2022-11-27 08:27:53,597] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_12-model_00-model_states.pt. 0: [2022-11-27 08:27:53,598] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_13-model_00-model_states.pt... 0: [2022-11-27 08:27:53,711] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_13-model_00-model_states.pt. 0: [2022-11-27 08:27:53,712] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_14-model_00-model_states.pt... 0: [2022-11-27 08:27:53,826] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_14-model_00-model_states.pt. 0: [2022-11-27 08:27:53,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_15-model_00-model_states.pt... 0: [2022-11-27 08:27:53,934] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_15-model_00-model_states.pt. 0: [2022-11-27 08:27:53,935] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_16-model_00-model_states.pt... 0: [2022-11-27 08:27:54,045] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_16-model_00-model_states.pt. 0: [2022-11-27 08:27:54,045] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_17-model_00-model_states.pt... 0: [2022-11-27 08:27:54,156] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_17-model_00-model_states.pt. 0: [2022-11-27 08:27:54,157] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_18-model_00-model_states.pt... 0: [2022-11-27 08:27:54,269] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_18-model_00-model_states.pt. 0: [2022-11-27 08:27:54,270] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_19-model_00-model_states.pt... 0: [2022-11-27 08:27:54,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_19-model_00-model_states.pt. 0: [2022-11-27 08:27:54,383] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_20-model_00-model_states.pt... 0: [2022-11-27 08:27:54,613] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_20-model_00-model_states.pt. 0: [2022-11-27 08:27:54,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_21-model_00-model_states.pt... 0: [2022-11-27 08:27:54,822] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_21-model_00-model_states.pt. 0: [2022-11-27 08:27:54,823] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_22-model_00-model_states.pt... 0: [2022-11-27 08:27:55,026] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_22-model_00-model_states.pt. 0: [2022-11-27 08:27:55,026] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_23-model_00-model_states.pt... 0: [2022-11-27 08:27:55,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_23-model_00-model_states.pt. 0: [2022-11-27 08:27:55,187] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_24-model_00-model_states.pt... 0: [2022-11-27 08:27:55,302] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_24-model_00-model_states.pt. 0: [2022-11-27 08:27:55,302] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_25-model_00-model_states.pt... 0: [2022-11-27 08:27:55,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_25-model_00-model_states.pt. 0: [2022-11-27 08:27:55,413] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_26-model_00-model_states.pt... 0: [2022-11-27 08:27:55,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_26-model_00-model_states.pt. 0: [2022-11-27 08:27:55,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_27-model_00-model_states.pt... 0: [2022-11-27 08:27:55,631] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_27-model_00-model_states.pt. 0: [2022-11-27 08:27:55,632] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_28-model_00-model_states.pt... 0: [2022-11-27 08:27:55,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_28-model_00-model_states.pt. 0: [2022-11-27 08:27:55,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_29-model_00-model_states.pt... 0: [2022-11-27 08:27:55,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_29-model_00-model_states.pt. 0: [2022-11-27 08:27:55,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_30-model_00-model_states.pt... 0: [2022-11-27 08:27:55,957] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_30-model_00-model_states.pt. 0: [2022-11-27 08:27:55,957] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/layer_32-model_00-model_states.pt... 0: [2022-11-27 08:27:55,963] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/layer_32-model_00-model_states.pt. 0: [2022-11-27 08:27:55,964] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step123000/mp_rank_00_model_states.pt 0: [2022-11-27 08:27:55,964] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/mp_rank_00_model_states.pt... 0: [2022-11-27 08:27:55,967] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/mp_rank_00_model_states.pt. 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:27:56,009] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:27:56,010] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step123000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:27:56,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,167] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 08:27:56,167] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,168] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,174] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,175] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,175] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,176] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,177] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:27:56,177] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,177] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-27 08:27:56,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,178] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,178] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:27:56,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,179] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:27:56,179] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,179] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-27 08:27:56,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,181] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,181] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-27 08:27:56,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,182] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,182] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-27 08:27:56,183] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,183] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,183] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 08:27:56,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,184] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 08:27:56,184] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,185] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 08:27:56,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:27:56,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-27 08:27:56,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,188] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:27:56,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:27:56,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 08:27:56,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 08:27:56,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,191] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 08:27:56,191] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 08:27:56,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:27:56,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,184] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,185] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,185] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,186] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,186] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,186] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,197] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,200] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 08:27:56,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 2: [2022-11-27 08:27:56,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,201] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,201] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,201] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 08:27:56,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,204] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,204] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,204] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,205] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-27 08:27:56,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,208] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,208] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:27:56,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 10: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 12: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:27:56,210] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 08:27:56,210] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 3: [2022-11-27 08:27:56,211] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:27:56,211] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 08:27:56,211] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,197] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,198] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,198] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 9: [2022-11-27 08:27:56,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:27:56,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 08:27:56,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,188] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,188] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,190] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,190] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,190] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,203] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,203] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,203] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,205] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,205] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 14: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:27:56,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 08:27:56,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,189] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,189] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,202] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:27:56,202] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,202] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:27:56,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:27:56,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:27:56,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 8: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:27:56,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 6: [2022-11-27 08:27:56,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:27:56,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 08:27:56,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,225] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,225] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:27:56,225] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 1: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:27:56,226] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 08:27:56,226] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,176] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,176] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:27:56,187] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,187] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:27:56,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:27:56,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:27:56,199] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,199] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 15: [2022-11-27 08:27:56,217] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:27:56,217] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 08:27:56,217] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 7: [2022-11-27 08:27:56,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:27:56,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 08:27:56,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,192] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,192] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,192] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,193] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,193] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,196] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,196] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,206] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,206] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,207] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,207] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,207] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,209] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,209] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,212] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,212] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,212] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 13: [2022-11-27 08:27:56,219] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:27:56,220] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 08:27:56,220] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,240] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,240] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,241] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 4: [2022-11-27 08:27:56,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:27:56,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 08:27:56,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: [2022-11-27 08:27:56,301] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 08:27:56,302] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,339] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 11: [2022-11-27 08:27:56,340] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,462] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,462] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:27:56,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,463] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step123000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 08:27:56,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 5: [2022-11-27 08:27:56,463] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step123000 is ready now! 0: successfully saved checkpoint at iteration 123000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 4328.85 15: iteration 123010/ 125429 | consumed samples: 31490560 | consumed tokens: 64492666880 | elapsed time per iteration (s): 2.26 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.888897E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 113.429 | TFLOPs: 18.75 | 15: iteration 123020/ 125429 | consumed samples: 31493120 | consumed tokens: 64497909760 | elapsed time per iteration (s): 1.91 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.867330E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 133.848 | TFLOPs: 22.12 | 15: iteration 123030/ 125429 | consumed samples: 31495680 | consumed tokens: 64503152640 | elapsed time per iteration (s): 1.03 | learning rate: 2.017E-05 | global batch size: 256 | lm loss: 1.879932E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.892 | TFLOPs: 41.13 | 15: iteration 123040/ 125429 | consumed samples: 31498240 | consumed tokens: 64508395520 | elapsed time per iteration (s): 1.03 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.872620E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.726 | TFLOPs: 41.10 | 15: iteration 123050/ 125429 | consumed samples: 31500800 | consumed tokens: 64513638400 | elapsed time per iteration (s): 1.05 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.894083E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.033 | TFLOPs: 40.33 | 15: iteration 123060/ 125429 | consumed samples: 31503360 | consumed tokens: 64518881280 | elapsed time per iteration (s): 3.32 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.907307E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 77.146 | TFLOPs: 12.75 | 15: iteration 123070/ 125429 | consumed samples: 31505920 | consumed tokens: 64524124160 | elapsed time per iteration (s): 1.03 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.871788E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.550 | TFLOPs: 41.07 | 15: iteration 123080/ 125429 | consumed samples: 31508480 | consumed tokens: 64529367040 | elapsed time per iteration (s): 1.04 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.915321E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.100 | TFLOPs: 40.67 | 15: iteration 123090/ 125429 | consumed samples: 31511040 | consumed tokens: 64534609920 | elapsed time per iteration (s): 1.03 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.876114E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.382 | TFLOPs: 40.88 | 15: iteration 123100/ 125429 | consumed samples: 31513600 | consumed tokens: 64539852800 | elapsed time per iteration (s): 1.08 | learning rate: 2.016E-05 | global batch size: 256 | lm loss: 1.897878E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.090 | TFLOPs: 39.35 | 15: iteration 123110/ 125429 | consumed samples: 31516160 | consumed tokens: 64545095680 | elapsed time per iteration (s): 1.05 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.915239E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.742 | TFLOPs: 40.11 | 15: iteration 123120/ 125429 | consumed samples: 31518720 | consumed tokens: 64550338560 | elapsed time per iteration (s): 1.07 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.871535E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.557 | TFLOPs: 39.59 | 15: iteration 123130/ 125429 | consumed samples: 31521280 | consumed tokens: 64555581440 | elapsed time per iteration (s): 1.06 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.902436E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.595 | TFLOPs: 39.93 | 15: iteration 123140/ 125429 | consumed samples: 31523840 | consumed tokens: 64560824320 | elapsed time per iteration (s): 1.05 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.883142E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.765 | TFLOPs: 40.45 | 15: iteration 123150/ 125429 | consumed samples: 31526400 | consumed tokens: 64566067200 | elapsed time per iteration (s): 1.04 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.853976E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.835 | TFLOPs: 40.79 | 15: iteration 123160/ 125429 | consumed samples: 31528960 | consumed tokens: 64571310080 | elapsed time per iteration (s): 1.09 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.880733E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.597 | TFLOPs: 38.77 | 15: iteration 123170/ 125429 | consumed samples: 31531520 | consumed tokens: 64576552960 | elapsed time per iteration (s): 1.05 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.894865E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.766 | TFLOPs: 40.12 | 15: iteration 123180/ 125429 | consumed samples: 31534080 | consumed tokens: 64581795840 | elapsed time per iteration (s): 1.03 | learning rate: 2.015E-05 | global batch size: 256 | lm loss: 1.901529E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.553 | TFLOPs: 41.08 | 15: iteration 123190/ 125429 | consumed samples: 31536640 | consumed tokens: 64587038720 | elapsed time per iteration (s): 1.06 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.882774E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.202 | TFLOPs: 39.86 | 15: iteration 123200/ 125429 | consumed samples: 31539200 | consumed tokens: 64592281600 | elapsed time per iteration (s): 1.05 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.907739E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.338 | TFLOPs: 40.38 | 15: iteration 123210/ 125429 | consumed samples: 31541760 | consumed tokens: 64597524480 | elapsed time per iteration (s): 1.04 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.907419E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.912 | TFLOPs: 40.80 | 15: iteration 123220/ 125429 | consumed samples: 31544320 | consumed tokens: 64602767360 | elapsed time per iteration (s): 1.04 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.871481E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.696 | TFLOPs: 40.77 | 15: iteration 123230/ 125429 | consumed samples: 31546880 | consumed tokens: 64608010240 | elapsed time per iteration (s): 1.07 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.881694E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.259 | TFLOPs: 39.70 | 15: iteration 123240/ 125429 | consumed samples: 31549440 | consumed tokens: 64613253120 | elapsed time per iteration (s): 1.06 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.879032E+00 | grad norm: 0.191 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.232 | TFLOPs: 40.03 | 15: iteration 123250/ 125429 | consumed samples: 31552000 | consumed tokens: 64618496000 | elapsed time per iteration (s): 1.02 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.914299E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.320 | TFLOPs: 41.37 | 15: iteration 123260/ 125429 | consumed samples: 31554560 | consumed tokens: 64623738880 | elapsed time per iteration (s): 1.03 | learning rate: 2.014E-05 | global batch size: 256 | lm loss: 1.878759E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.370 | TFLOPs: 40.88 | 15: iteration 123270/ 125429 | consumed samples: 31557120 | consumed tokens: 64628981760 | elapsed time per iteration (s): 1.05 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.855598E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.734 | TFLOPs: 40.11 | 15: iteration 123280/ 125429 | consumed samples: 31559680 | consumed tokens: 64634224640 | elapsed time per iteration (s): 1.07 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.886839E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.728 | TFLOPs: 39.62 | 15: iteration 123290/ 125429 | consumed samples: 31562240 | consumed tokens: 64639467520 | elapsed time per iteration (s): 1.08 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.902565E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.396 | TFLOPs: 39.23 | 15: iteration 123300/ 125429 | consumed samples: 31564800 | consumed tokens: 64644710400 | elapsed time per iteration (s): 1.06 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.864337E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.788 | TFLOPs: 39.79 | 15: iteration 123310/ 125429 | consumed samples: 31567360 | consumed tokens: 64649953280 | elapsed time per iteration (s): 1.04 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.888520E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.753 | TFLOPs: 40.61 | 15: iteration 123320/ 125429 | consumed samples: 31569920 | consumed tokens: 64655196160 | elapsed time per iteration (s): 1.02 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.898803E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.766 | TFLOPs: 41.44 | 15: iteration 123330/ 125429 | consumed samples: 31572480 | consumed tokens: 64660439040 | elapsed time per iteration (s): 1.05 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.853819E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.822 | TFLOPs: 40.29 | 15: iteration 123340/ 125429 | consumed samples: 31575040 | consumed tokens: 64665681920 | elapsed time per iteration (s): 1.03 | learning rate: 2.013E-05 | global batch size: 256 | lm loss: 1.894956E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.908 | TFLOPs: 41.13 | 15: iteration 123350/ 125429 | consumed samples: 31577600 | consumed tokens: 64670924800 | elapsed time per iteration (s): 1.07 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.888706E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.498 | TFLOPs: 39.58 | 15: iteration 123360/ 125429 | consumed samples: 31580160 | consumed tokens: 64676167680 | elapsed time per iteration (s): 1.04 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.894866E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.874 | TFLOPs: 40.80 | 15: iteration 123370/ 125429 | consumed samples: 31582720 | consumed tokens: 64681410560 | elapsed time per iteration (s): 1.03 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.897561E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.832 | TFLOPs: 41.12 | 15: iteration 123380/ 125429 | consumed samples: 31585280 | consumed tokens: 64686653440 | elapsed time per iteration (s): 1.05 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.852672E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.864 | TFLOPs: 40.47 | 15: iteration 123390/ 125429 | consumed samples: 31587840 | consumed tokens: 64691896320 | elapsed time per iteration (s): 1.02 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.883523E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.847 | TFLOPs: 41.29 | 15: iteration 123400/ 125429 | consumed samples: 31590400 | consumed tokens: 64697139200 | elapsed time per iteration (s): 1.06 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.885508E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.339 | TFLOPs: 39.88 | 15: iteration 123410/ 125429 | consumed samples: 31592960 | consumed tokens: 64702382080 | elapsed time per iteration (s): 1.06 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.882403E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.731 | TFLOPs: 39.95 | 15: iteration 123420/ 125429 | consumed samples: 31595520 | consumed tokens: 64707624960 | elapsed time per iteration (s): 1.06 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.889875E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.524 | TFLOPs: 40.08 | 15: iteration 123430/ 125429 | consumed samples: 31598080 | consumed tokens: 64712867840 | elapsed time per iteration (s): 1.04 | learning rate: 2.012E-05 | global batch size: 256 | lm loss: 1.893533E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.240 | TFLOPs: 40.69 | 15: iteration 123440/ 125429 | consumed samples: 31600640 | consumed tokens: 64718110720 | elapsed time per iteration (s): 1.06 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.881748E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.357 | TFLOPs: 39.89 | 15: iteration 123450/ 125429 | consumed samples: 31603200 | consumed tokens: 64723353600 | elapsed time per iteration (s): 1.07 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.867634E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.886 | TFLOPs: 39.64 | 15: iteration 123460/ 125429 | consumed samples: 31605760 | consumed tokens: 64728596480 | elapsed time per iteration (s): 1.19 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.890062E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.526 | TFLOPs: 35.62 | 15: iteration 123470/ 125429 | consumed samples: 31608320 | consumed tokens: 64733839360 | elapsed time per iteration (s): 1.03 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.865799E+00 | grad norm: 0.155 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.578 | TFLOPs: 40.91 | 15: iteration 123480/ 125429 | consumed samples: 31610880 | consumed tokens: 64739082240 | elapsed time per iteration (s): 1.06 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.872851E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.079 | TFLOPs: 39.84 | 15: iteration 123490/ 125429 | consumed samples: 31613440 | consumed tokens: 64744325120 | elapsed time per iteration (s): 1.05 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.879455E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.920 | TFLOPs: 40.31 | 15: iteration 123500/ 125429 | consumed samples: 31616000 | consumed tokens: 64749568000 | elapsed time per iteration (s): 1.06 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.868663E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.801 | TFLOPs: 39.79 | 15: iteration 123510/ 125429 | consumed samples: 31618560 | consumed tokens: 64754810880 | elapsed time per iteration (s): 1.02 | learning rate: 2.011E-05 | global batch size: 256 | lm loss: 1.900868E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.267 | TFLOPs: 41.36 | 15: iteration 123520/ 125429 | consumed samples: 31621120 | consumed tokens: 64760053760 | elapsed time per iteration (s): 1.21 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.861307E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 211.748 | TFLOPs: 34.99 | 15: iteration 123530/ 125429 | consumed samples: 31623680 | consumed tokens: 64765296640 | elapsed time per iteration (s): 1.07 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.874837E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.335 | TFLOPs: 39.72 | 15: iteration 123540/ 125429 | consumed samples: 31626240 | consumed tokens: 64770539520 | elapsed time per iteration (s): 1.03 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.880794E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.729 | TFLOPs: 41.10 | 15: iteration 123550/ 125429 | consumed samples: 31628800 | consumed tokens: 64775782400 | elapsed time per iteration (s): 1.04 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.889271E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.889 | TFLOPs: 40.64 | 15: iteration 123560/ 125429 | consumed samples: 31631360 | consumed tokens: 64781025280 | elapsed time per iteration (s): 1.04 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.883979E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.011 | TFLOPs: 40.49 | 15: iteration 123570/ 125429 | consumed samples: 31633920 | consumed tokens: 64786268160 | elapsed time per iteration (s): 1.05 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.889650E+00 | grad norm: 0.157 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.258 | TFLOPs: 40.37 | 15: iteration 123580/ 125429 | consumed samples: 31636480 | consumed tokens: 64791511040 | elapsed time per iteration (s): 1.06 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.890285E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.546 | TFLOPs: 40.08 | 15: iteration 123590/ 125429 | consumed samples: 31639040 | consumed tokens: 64796753920 | elapsed time per iteration (s): 1.06 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.901121E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.397 | TFLOPs: 39.73 | 15: iteration 123600/ 125429 | consumed samples: 31641600 | consumed tokens: 64801996800 | elapsed time per iteration (s): 1.09 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.881293E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.802 | TFLOPs: 38.64 | 15: iteration 123610/ 125429 | consumed samples: 31644160 | consumed tokens: 64807239680 | elapsed time per iteration (s): 1.03 | learning rate: 2.010E-05 | global batch size: 256 | lm loss: 1.901555E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.397 | TFLOPs: 41.05 | 15: iteration 123620/ 125429 | consumed samples: 31646720 | consumed tokens: 64812482560 | elapsed time per iteration (s): 1.03 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.890355E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.775 | TFLOPs: 40.95 | 15: iteration 123630/ 125429 | consumed samples: 31649280 | consumed tokens: 64817725440 | elapsed time per iteration (s): 1.05 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.877709E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.782 | TFLOPs: 40.12 | 15: iteration 123640/ 125429 | consumed samples: 31651840 | consumed tokens: 64822968320 | elapsed time per iteration (s): 1.06 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.885799E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.407 | TFLOPs: 39.73 | 15: iteration 123650/ 125429 | consumed samples: 31654400 | consumed tokens: 64828211200 | elapsed time per iteration (s): 1.03 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.876281E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.619 | TFLOPs: 41.09 | 15: iteration 123660/ 125429 | consumed samples: 31656960 | consumed tokens: 64833454080 | elapsed time per iteration (s): 1.07 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.932034E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.634 | TFLOPs: 39.60 | 15: iteration 123670/ 125429 | consumed samples: 31659520 | consumed tokens: 64838696960 | elapsed time per iteration (s): 1.07 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.901150E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.005 | TFLOPs: 39.50 | 15: iteration 123680/ 125429 | consumed samples: 31662080 | consumed tokens: 64843939840 | elapsed time per iteration (s): 1.06 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.883375E+00 | grad norm: 0.188 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.139 | TFLOPs: 40.02 | 15: iteration 123690/ 125429 | consumed samples: 31664640 | consumed tokens: 64849182720 | elapsed time per iteration (s): 1.04 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.884563E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.648 | TFLOPs: 40.60 | 15: iteration 123700/ 125429 | consumed samples: 31667200 | consumed tokens: 64854425600 | elapsed time per iteration (s): 1.04 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.895464E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.295 | TFLOPs: 40.54 | 15: iteration 123710/ 125429 | consumed samples: 31669760 | consumed tokens: 64859668480 | elapsed time per iteration (s): 1.05 | learning rate: 2.009E-05 | global batch size: 256 | lm loss: 1.890039E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.026 | TFLOPs: 40.16 | 15: iteration 123720/ 125429 | consumed samples: 31672320 | consumed tokens: 64864911360 | elapsed time per iteration (s): 1.07 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.895977E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.658 | TFLOPs: 39.61 | 15: iteration 123730/ 125429 | consumed samples: 31674880 | consumed tokens: 64870154240 | elapsed time per iteration (s): 1.06 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.878988E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.762 | TFLOPs: 39.79 | 15: iteration 123740/ 125429 | consumed samples: 31677440 | consumed tokens: 64875397120 | elapsed time per iteration (s): 1.07 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.897912E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.485 | TFLOPs: 39.41 | 15: iteration 123750/ 125429 | consumed samples: 31680000 | consumed tokens: 64880640000 | elapsed time per iteration (s): 1.04 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.899850E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.518 | TFLOPs: 40.57 | 15: iteration 123760/ 125429 | consumed samples: 31682560 | consumed tokens: 64885882880 | elapsed time per iteration (s): 1.02 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.909072E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.285 | TFLOPs: 41.53 | 15: iteration 123770/ 125429 | consumed samples: 31685120 | consumed tokens: 64891125760 | elapsed time per iteration (s): 1.02 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.875824E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.519 | TFLOPs: 41.57 | 15: iteration 123780/ 125429 | consumed samples: 31687680 | consumed tokens: 64896368640 | elapsed time per iteration (s): 1.05 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.894179E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.973 | TFLOPs: 40.15 | 15: iteration 123790/ 125429 | consumed samples: 31690240 | consumed tokens: 64901611520 | elapsed time per iteration (s): 1.05 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.904719E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.865 | TFLOPs: 40.30 | 15: iteration 123800/ 125429 | consumed samples: 31692800 | consumed tokens: 64906854400 | elapsed time per iteration (s): 1.03 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.883464E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.043 | TFLOPs: 41.16 | 15: iteration 123810/ 125429 | consumed samples: 31695360 | consumed tokens: 64912097280 | elapsed time per iteration (s): 1.03 | learning rate: 2.008E-05 | global batch size: 256 | lm loss: 1.900211E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.133 | TFLOPs: 41.17 | 15: iteration 123820/ 125429 | consumed samples: 31697920 | consumed tokens: 64917340160 | elapsed time per iteration (s): 1.08 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.880106E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.921 | TFLOPs: 39.15 | 15: iteration 123830/ 125429 | consumed samples: 31700480 | consumed tokens: 64922583040 | elapsed time per iteration (s): 1.04 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.877087E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.013 | TFLOPs: 40.66 | 15: iteration 123840/ 125429 | consumed samples: 31703040 | consumed tokens: 64927825920 | elapsed time per iteration (s): 1.06 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.879659E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.604 | TFLOPs: 40.09 | 15: iteration 123850/ 125429 | consumed samples: 31705600 | consumed tokens: 64933068800 | elapsed time per iteration (s): 1.14 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.885907E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 224.364 | TFLOPs: 37.08 | 15: iteration 123860/ 125429 | consumed samples: 31708160 | consumed tokens: 64938311680 | elapsed time per iteration (s): 1.06 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.847938E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.699 | TFLOPs: 39.94 | 15: iteration 123870/ 125429 | consumed samples: 31710720 | consumed tokens: 64943554560 | elapsed time per iteration (s): 1.07 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.912326E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.649 | TFLOPs: 39.44 | 15: iteration 123880/ 125429 | consumed samples: 31713280 | consumed tokens: 64948797440 | elapsed time per iteration (s): 1.07 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.863135E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.302 | TFLOPs: 39.38 | 15: iteration 123890/ 125429 | consumed samples: 31715840 | consumed tokens: 64954040320 | elapsed time per iteration (s): 1.12 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.903769E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 228.922 | TFLOPs: 37.83 | 15: iteration 123900/ 125429 | consumed samples: 31718400 | consumed tokens: 64959283200 | elapsed time per iteration (s): 1.05 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.894624E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.909 | TFLOPs: 40.31 | 15: iteration 123910/ 125429 | consumed samples: 31720960 | consumed tokens: 64964526080 | elapsed time per iteration (s): 1.07 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.876449E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.171 | TFLOPs: 39.69 | 15: iteration 123920/ 125429 | consumed samples: 31723520 | consumed tokens: 64969768960 | elapsed time per iteration (s): 1.08 | learning rate: 2.007E-05 | global batch size: 256 | lm loss: 1.865483E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.002 | TFLOPs: 39.17 | 15: iteration 123930/ 125429 | consumed samples: 31726080 | consumed tokens: 64975011840 | elapsed time per iteration (s): 1.05 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.885665E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.138 | TFLOPs: 40.18 | 15: iteration 123940/ 125429 | consumed samples: 31728640 | consumed tokens: 64980254720 | elapsed time per iteration (s): 1.19 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.907907E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 214.278 | TFLOPs: 35.41 | 15: iteration 123950/ 125429 | consumed samples: 31731200 | consumed tokens: 64985497600 | elapsed time per iteration (s): 1.06 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.904491E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.907 | TFLOPs: 39.81 | 15: iteration 123960/ 125429 | consumed samples: 31733760 | consumed tokens: 64990740480 | elapsed time per iteration (s): 1.02 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.886151E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.421 | TFLOPs: 41.38 | 15: iteration 123970/ 125429 | consumed samples: 31736320 | consumed tokens: 64995983360 | elapsed time per iteration (s): 1.03 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.884176E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.494 | TFLOPs: 41.23 | 15: iteration 123980/ 125429 | consumed samples: 31738880 | consumed tokens: 65001226240 | elapsed time per iteration (s): 1.05 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.859620E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.331 | TFLOPs: 40.38 | 15: iteration 123990/ 125429 | consumed samples: 31741440 | consumed tokens: 65006469120 | elapsed time per iteration (s): 1.20 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.884891E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.625 | TFLOPs: 35.14 | 0: [2022-11-27 08:46:12,573] [INFO] [logging.py:68:log_dist] [Rank 0] step=124000, skipped=0, lr=[2.0058816380794858e-05, 2.0058816380794858e-05, 2.0058816380794858e-05], mom=[(0.9, 0.999), (0.9, 0.999), (0.9, 0.999)] 15: iteration 124000/ 125429 | consumed samples: 31744000 | consumed tokens: 65011712000 | elapsed time per iteration (s): 1.03 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.901572E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.780 | TFLOPs: 40.95 | 0: steps: 124000 loss: 1.9801 iter time (s): 1.124 samples/sec: 227.822 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 124000 | lm loss value: 1.807271E+00 | lm loss PPL: 6.093797E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 124000 to checkpoints_1b5 0: [2022-11-27 08:46:12,940] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step124000 is begin to save! 0: [2022-11-27 08:46:12,949] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_01-model_00-model_states.pt... 0: [2022-11-27 08:46:13,200] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_01-model_00-model_states.pt. 0: [2022-11-27 08:46:13,200] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_03-model_00-model_states.pt... 0: [2022-11-27 08:46:13,306] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_03-model_00-model_states.pt. 0: [2022-11-27 08:46:13,307] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_04-model_00-model_states.pt... 0: [2022-11-27 08:46:13,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_04-model_00-model_states.pt. 0: [2022-11-27 08:46:13,415] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_05-model_00-model_states.pt... 0: [2022-11-27 08:46:13,520] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_05-model_00-model_states.pt. 0: [2022-11-27 08:46:13,521] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_06-model_00-model_states.pt... 0: [2022-11-27 08:46:13,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_06-model_00-model_states.pt. 0: [2022-11-27 08:46:13,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_07-model_00-model_states.pt... 0: [2022-11-27 08:46:13,739] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_07-model_00-model_states.pt. 0: [2022-11-27 08:46:13,739] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_08-model_00-model_states.pt... 0: [2022-11-27 08:46:13,844] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_08-model_00-model_states.pt. 0: [2022-11-27 08:46:13,845] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_09-model_00-model_states.pt... 0: [2022-11-27 08:46:13,950] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_09-model_00-model_states.pt. 0: [2022-11-27 08:46:13,950] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_10-model_00-model_states.pt... 0: [2022-11-27 08:46:14,056] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_10-model_00-model_states.pt. 0: [2022-11-27 08:46:14,056] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_11-model_00-model_states.pt... 0: [2022-11-27 08:46:14,161] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_11-model_00-model_states.pt. 0: [2022-11-27 08:46:14,161] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_12-model_00-model_states.pt... 0: [2022-11-27 08:46:14,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_12-model_00-model_states.pt. 0: [2022-11-27 08:46:14,267] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_13-model_00-model_states.pt... 0: [2022-11-27 08:46:14,372] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_13-model_00-model_states.pt. 0: [2022-11-27 08:46:14,373] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_14-model_00-model_states.pt... 0: [2022-11-27 08:46:14,479] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_14-model_00-model_states.pt. 0: [2022-11-27 08:46:14,479] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_15-model_00-model_states.pt... 0: [2022-11-27 08:46:14,586] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_15-model_00-model_states.pt. 0: [2022-11-27 08:46:14,586] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_16-model_00-model_states.pt... 0: [2022-11-27 08:46:14,692] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_16-model_00-model_states.pt. 0: [2022-11-27 08:46:14,693] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_17-model_00-model_states.pt... 0: [2022-11-27 08:46:14,799] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_17-model_00-model_states.pt. 0: [2022-11-27 08:46:14,800] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_18-model_00-model_states.pt... 0: [2022-11-27 08:46:14,905] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_18-model_00-model_states.pt. 0: [2022-11-27 08:46:14,905] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_19-model_00-model_states.pt... 0: [2022-11-27 08:46:15,012] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_19-model_00-model_states.pt. 0: [2022-11-27 08:46:15,012] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_20-model_00-model_states.pt... 0: [2022-11-27 08:46:15,118] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_20-model_00-model_states.pt. 0: [2022-11-27 08:46:15,118] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_21-model_00-model_states.pt... 0: [2022-11-27 08:46:15,221] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_21-model_00-model_states.pt. 0: [2022-11-27 08:46:15,222] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_22-model_00-model_states.pt... 0: [2022-11-27 08:46:15,332] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_22-model_00-model_states.pt. 0: [2022-11-27 08:46:15,332] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_23-model_00-model_states.pt... 0: [2022-11-27 08:46:15,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_23-model_00-model_states.pt. 0: [2022-11-27 08:46:15,435] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_24-model_00-model_states.pt... 0: [2022-11-27 08:46:15,543] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_24-model_00-model_states.pt. 0: [2022-11-27 08:46:15,543] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_25-model_00-model_states.pt... 0: [2022-11-27 08:46:15,647] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_25-model_00-model_states.pt. 0: [2022-11-27 08:46:15,647] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_26-model_00-model_states.pt... 0: [2022-11-27 08:46:15,754] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_26-model_00-model_states.pt. 0: [2022-11-27 08:46:15,754] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_27-model_00-model_states.pt... 0: [2022-11-27 08:46:15,859] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_27-model_00-model_states.pt. 0: [2022-11-27 08:46:15,860] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_28-model_00-model_states.pt... 0: [2022-11-27 08:46:15,965] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_28-model_00-model_states.pt. 0: [2022-11-27 08:46:15,965] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_29-model_00-model_states.pt... 0: [2022-11-27 08:46:16,071] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_29-model_00-model_states.pt. 0: [2022-11-27 08:46:16,071] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_30-model_00-model_states.pt... 0: [2022-11-27 08:46:16,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_30-model_00-model_states.pt. 0: [2022-11-27 08:46:16,180] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/layer_32-model_00-model_states.pt... 0: [2022-11-27 08:46:16,187] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/layer_32-model_00-model_states.pt. 0: [2022-11-27 08:46:16,188] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step124000/mp_rank_00_model_states.pt 0: [2022-11-27 08:46:16,188] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/mp_rank_00_model_states.pt... 0: [2022-11-27 08:46:16,191] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/mp_rank_00_model_states.pt. 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 10: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 1: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 9: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 6: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 2: [2022-11-27 08:46:16,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step124000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 0: [2022-11-27 08:46:16,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-27 08:46:16,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,394] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,394] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,394] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-27 08:46:16,396] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,396] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,396] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,398] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-27 08:46:16,405] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,405] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,405] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-27 08:46:16,406] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,406] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,406] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-27 08:46:16,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-27 08:46:16,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,409] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:46:16,409] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 08:46:16,409] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,392] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,392] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,392] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,411] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-27 08:46:16,411] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-27 08:46:16,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,415] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:46:16,415] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 08:46:16,415] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-27 08:46:16,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 12: [2022-11-27 08:46:16,421] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 08:46:16,421] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 08:46:16,421] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:46:16,422] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,422] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:46:16,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 08:46:16,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-27 08:46:16,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,399] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-27 08:46:16,399] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-27 08:46:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,400] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,400] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-27 08:46:16,407] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,407] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,407] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,424] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,424] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,413] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 08:46:16,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,416] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,416] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,403] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 4: [2022-11-27 08:46:16,416] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,403] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,412] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,412] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,427] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,412] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,422] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,414] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 4: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,414] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,417] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,417] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 4: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,417] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,418] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,418] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 4: [2022-11-27 08:46:16,426] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,418] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,426] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,420] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,426] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,420] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 4: [2022-11-27 08:46:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,420] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,423] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,423] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 13: [2022-11-27 08:46:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 08:46:16,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 08:46:16,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:46:16,431] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,431] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-27 08:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,431] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,432] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 6: [2022-11-27 08:46:16,432] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-27 08:46:16,433] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 08:46:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-27 08:46:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-27 08:46:16,435] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,398] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,398] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,403] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:46:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,404] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:46:16,404] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,404] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:46:16,408] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,408] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,424] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:46:16,425] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,425] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:46:16,435] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,435] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-27 08:46:16,427] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,427] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 15: [2022-11-27 08:46:16,428] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 08:46:16,428] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 08:46:16,428] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 7: [2022-11-27 08:46:16,452] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 08:46:16,452] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 08:46:16,452] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 4: [2022-11-27 08:46:16,451] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 08:46:16,451] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 08:46:16,451] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 10: [2022-11-27 08:46:16,463] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 08:46:16,464] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 08:46:16,464] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,434] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,434] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,434] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,444] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:46:16,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 11: [2022-11-27 08:46:16,445] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 08:46:16,445] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 08:46:16,445] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,513] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,513] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 14: [2022-11-27 08:46:16,514] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 08:46:16,514] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 08:46:16,514] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:46:16,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 08:46:16,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,528] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:46:16,528] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 08:46:16,528] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 9: [2022-11-27 08:46:16,529] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 08:46:16,529] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 08:46:16,529] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,537] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 1: [2022-11-27 08:46:16,538] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: [2022-11-27 08:46:16,610] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 08:46:16,610] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,664] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,664] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 08:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,665] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 08:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 8: [2022-11-27 08:46:16,665] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,671] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,671] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,672] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,672] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 08:46:16,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,673] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 08:46:16,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 5: [2022-11-27 08:46:16,673] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,681] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,681] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 3: [2022-11-27 08:46:16,682] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 08:46:16,682] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 2: [2022-11-27 08:46:16,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step124000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 08:46:16,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step124000 is ready now! 0: successfully saved checkpoint at iteration 124000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3797.20 15: iteration 124010/ 125429 | consumed samples: 31746560 | consumed tokens: 65016954880 | elapsed time per iteration (s): 1.44 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.865125E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 177.581 | TFLOPs: 29.35 | 15: iteration 124020/ 125429 | consumed samples: 31749120 | consumed tokens: 65022197760 | elapsed time per iteration (s): 1.22 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.884170E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 209.784 | TFLOPs: 34.67 | 15: iteration 124030/ 125429 | consumed samples: 31751680 | consumed tokens: 65027440640 | elapsed time per iteration (s): 1.03 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.889000E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.423 | TFLOPs: 41.22 | 15: iteration 124040/ 125429 | consumed samples: 31754240 | consumed tokens: 65032683520 | elapsed time per iteration (s): 1.02 | learning rate: 2.006E-05 | global batch size: 256 | lm loss: 1.879693E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.149 | TFLOPs: 41.50 | 15: iteration 124050/ 125429 | consumed samples: 31756800 | consumed tokens: 65037926400 | elapsed time per iteration (s): 1.03 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.848781E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.507 | TFLOPs: 41.23 | 15: iteration 124060/ 125429 | consumed samples: 31759360 | consumed tokens: 65043169280 | elapsed time per iteration (s): 1.05 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.871604E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.434 | TFLOPs: 40.39 | 15: iteration 124070/ 125429 | consumed samples: 31761920 | consumed tokens: 65048412160 | elapsed time per iteration (s): 1.04 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.884125E+00 | grad norm: 0.174 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.813 | TFLOPs: 40.62 | 15: iteration 124080/ 125429 | consumed samples: 31764480 | consumed tokens: 65053655040 | elapsed time per iteration (s): 1.04 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.864467E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.238 | TFLOPs: 40.86 | 15: iteration 124090/ 125429 | consumed samples: 31767040 | consumed tokens: 65058897920 | elapsed time per iteration (s): 1.04 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.903274E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.977 | TFLOPs: 40.48 | 15: iteration 124100/ 125429 | consumed samples: 31769600 | consumed tokens: 65064140800 | elapsed time per iteration (s): 1.03 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.904072E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.030 | TFLOPs: 40.99 | 15: iteration 124110/ 125429 | consumed samples: 31772160 | consumed tokens: 65069383680 | elapsed time per iteration (s): 3.08 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.856815E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 83.049 | TFLOPs: 13.72 | 15: iteration 124120/ 125429 | consumed samples: 31774720 | consumed tokens: 65074626560 | elapsed time per iteration (s): 1.04 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.907368E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.648 | TFLOPs: 40.76 | 15: iteration 124130/ 125429 | consumed samples: 31777280 | consumed tokens: 65079869440 | elapsed time per iteration (s): 1.19 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.883475E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 215.890 | TFLOPs: 35.68 | 15: iteration 124140/ 125429 | consumed samples: 31779840 | consumed tokens: 65085112320 | elapsed time per iteration (s): 1.02 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.905322E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.134 | TFLOPs: 41.34 | 15: iteration 124150/ 125429 | consumed samples: 31782400 | consumed tokens: 65090355200 | elapsed time per iteration (s): 1.02 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.912950E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.415 | TFLOPs: 41.38 | 15: iteration 124160/ 125429 | consumed samples: 31784960 | consumed tokens: 65095598080 | elapsed time per iteration (s): 1.05 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.895805E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.734 | TFLOPs: 40.44 | 15: iteration 124170/ 125429 | consumed samples: 31787520 | consumed tokens: 65100840960 | elapsed time per iteration (s): 1.05 | learning rate: 2.005E-05 | global batch size: 256 | lm loss: 1.903582E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.674 | TFLOPs: 40.43 | 15: iteration 124180/ 125429 | consumed samples: 31790080 | consumed tokens: 65106083840 | elapsed time per iteration (s): 1.09 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.863958E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.289 | TFLOPs: 38.88 | 15: iteration 124190/ 125429 | consumed samples: 31792640 | consumed tokens: 65111326720 | elapsed time per iteration (s): 1.04 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.909521E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.489 | TFLOPs: 40.57 | 15: iteration 124200/ 125429 | consumed samples: 31795200 | consumed tokens: 65116569600 | elapsed time per iteration (s): 1.04 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.902171E+00 | grad norm: 0.186 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.973 | TFLOPs: 40.81 | 15: iteration 124210/ 125429 | consumed samples: 31797760 | consumed tokens: 65121812480 | elapsed time per iteration (s): 1.07 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.887718E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.793 | TFLOPs: 39.46 | 15: iteration 124220/ 125429 | consumed samples: 31800320 | consumed tokens: 65127055360 | elapsed time per iteration (s): 1.04 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.913216E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.070 | TFLOPs: 40.83 | 15: iteration 124230/ 125429 | consumed samples: 31802880 | consumed tokens: 65132298240 | elapsed time per iteration (s): 1.06 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.883048E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.034 | TFLOPs: 40.00 | 15: iteration 124240/ 125429 | consumed samples: 31805440 | consumed tokens: 65137541120 | elapsed time per iteration (s): 1.03 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.875006E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.628 | TFLOPs: 41.25 | 15: iteration 124250/ 125429 | consumed samples: 31808000 | consumed tokens: 65142784000 | elapsed time per iteration (s): 1.03 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.890166E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.097 | TFLOPs: 41.17 | 15: iteration 124260/ 125429 | consumed samples: 31810560 | consumed tokens: 65148026880 | elapsed time per iteration (s): 1.04 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.888402E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.147 | TFLOPs: 40.51 | 15: iteration 124270/ 125429 | consumed samples: 31813120 | consumed tokens: 65153269760 | elapsed time per iteration (s): 1.02 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.909372E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.997 | TFLOPs: 41.48 | 15: iteration 124280/ 125429 | consumed samples: 31815680 | consumed tokens: 65158512640 | elapsed time per iteration (s): 1.08 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.887180E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.415 | TFLOPs: 39.23 | 15: iteration 124290/ 125429 | consumed samples: 31818240 | consumed tokens: 65163755520 | elapsed time per iteration (s): 1.04 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.900686E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.726 | TFLOPs: 40.77 | 15: iteration 124300/ 125429 | consumed samples: 31820800 | consumed tokens: 65168998400 | elapsed time per iteration (s): 1.03 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.871419E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.464 | TFLOPs: 41.06 | 15: iteration 124310/ 125429 | consumed samples: 31823360 | consumed tokens: 65174241280 | elapsed time per iteration (s): 1.06 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.872640E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.699 | TFLOPs: 39.78 | 15: iteration 124320/ 125429 | consumed samples: 31825920 | consumed tokens: 65179484160 | elapsed time per iteration (s): 1.05 | learning rate: 2.004E-05 | global batch size: 256 | lm loss: 1.891499E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.355 | TFLOPs: 40.22 | 15: iteration 124330/ 125429 | consumed samples: 31828480 | consumed tokens: 65184727040 | elapsed time per iteration (s): 1.03 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.898695E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.880 | TFLOPs: 41.13 | 15: iteration 124340/ 125429 | consumed samples: 31831040 | consumed tokens: 65189969920 | elapsed time per iteration (s): 1.05 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.875970E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.835 | TFLOPs: 40.13 | 15: iteration 124350/ 125429 | consumed samples: 31833600 | consumed tokens: 65195212800 | elapsed time per iteration (s): 1.05 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.885406E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.903 | TFLOPs: 40.47 | 15: iteration 124360/ 125429 | consumed samples: 31836160 | consumed tokens: 65200455680 | elapsed time per iteration (s): 1.05 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.870293E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.911 | TFLOPs: 40.47 | 15: iteration 124370/ 125429 | consumed samples: 31838720 | consumed tokens: 65205698560 | elapsed time per iteration (s): 1.07 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.897191E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.215 | TFLOPs: 39.53 | 15: iteration 124380/ 125429 | consumed samples: 31841280 | consumed tokens: 65210941440 | elapsed time per iteration (s): 1.05 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.887395E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.749 | TFLOPs: 40.28 | 15: iteration 124390/ 125429 | consumed samples: 31843840 | consumed tokens: 65216184320 | elapsed time per iteration (s): 1.06 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.888569E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.098 | TFLOPs: 40.01 | 15: iteration 124400/ 125429 | consumed samples: 31846400 | consumed tokens: 65221427200 | elapsed time per iteration (s): 1.05 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.892106E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.359 | TFLOPs: 40.22 | 15: iteration 124410/ 125429 | consumed samples: 31848960 | consumed tokens: 65226670080 | elapsed time per iteration (s): 1.03 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.886956E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.099 | TFLOPs: 41.00 | 15: iteration 124420/ 125429 | consumed samples: 31851520 | consumed tokens: 65231912960 | elapsed time per iteration (s): 1.02 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.888684E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.118 | TFLOPs: 41.33 | 15: iteration 124430/ 125429 | consumed samples: 31854080 | consumed tokens: 65237155840 | elapsed time per iteration (s): 1.07 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.867458E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.000 | TFLOPs: 39.50 | 15: iteration 124440/ 125429 | consumed samples: 31856640 | consumed tokens: 65242398720 | elapsed time per iteration (s): 1.05 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.879349E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.851 | TFLOPs: 40.13 | 15: iteration 124450/ 125429 | consumed samples: 31859200 | consumed tokens: 65247641600 | elapsed time per iteration (s): 1.02 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.903500E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.683 | TFLOPs: 41.43 | 15: iteration 124460/ 125429 | consumed samples: 31861760 | consumed tokens: 65252884480 | elapsed time per iteration (s): 1.03 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.868553E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.980 | TFLOPs: 41.15 | 15: iteration 124470/ 125429 | consumed samples: 31864320 | consumed tokens: 65258127360 | elapsed time per iteration (s): 1.25 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.900944E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 204.595 | TFLOPs: 33.81 | 15: iteration 124480/ 125429 | consumed samples: 31866880 | consumed tokens: 65263370240 | elapsed time per iteration (s): 1.02 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.904066E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.706 | TFLOPs: 41.43 | 15: iteration 124490/ 125429 | consumed samples: 31869440 | consumed tokens: 65268613120 | elapsed time per iteration (s): 1.04 | learning rate: 2.003E-05 | global batch size: 256 | lm loss: 1.889122E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.752 | TFLOPs: 40.78 | 15: iteration 124500/ 125429 | consumed samples: 31872000 | consumed tokens: 65273856000 | elapsed time per iteration (s): 1.02 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.864758E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.065 | TFLOPs: 41.33 | 15: iteration 124510/ 125429 | consumed samples: 31874560 | consumed tokens: 65279098880 | elapsed time per iteration (s): 1.04 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.916900E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.709 | TFLOPs: 40.61 | 15: iteration 124520/ 125429 | consumed samples: 31877120 | consumed tokens: 65284341760 | elapsed time per iteration (s): 1.03 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.859335E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.829 | TFLOPs: 40.96 | 15: iteration 124530/ 125429 | consumed samples: 31879680 | consumed tokens: 65289584640 | elapsed time per iteration (s): 1.02 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.892335E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.030 | TFLOPs: 41.32 | 15: iteration 124540/ 125429 | consumed samples: 31882240 | consumed tokens: 65294827520 | elapsed time per iteration (s): 1.03 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.898825E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.325 | TFLOPs: 41.20 | 15: iteration 124550/ 125429 | consumed samples: 31884800 | consumed tokens: 65300070400 | elapsed time per iteration (s): 1.02 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.885662E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.718 | TFLOPs: 41.43 | 15: iteration 124560/ 125429 | consumed samples: 31887360 | consumed tokens: 65305313280 | elapsed time per iteration (s): 1.05 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.930177E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.928 | TFLOPs: 40.15 | 15: iteration 124570/ 125429 | consumed samples: 31889920 | consumed tokens: 65310556160 | elapsed time per iteration (s): 1.05 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.895213E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.675 | TFLOPs: 40.27 | 15: iteration 124580/ 125429 | consumed samples: 31892480 | consumed tokens: 65315799040 | elapsed time per iteration (s): 1.07 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.898824E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.407 | TFLOPs: 39.56 | 15: iteration 124590/ 125429 | consumed samples: 31895040 | consumed tokens: 65321041920 | elapsed time per iteration (s): 1.03 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.893642E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.373 | TFLOPs: 41.05 | 15: iteration 124600/ 125429 | consumed samples: 31897600 | consumed tokens: 65326284800 | elapsed time per iteration (s): 1.04 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.874639E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.862 | TFLOPs: 40.80 | 15: iteration 124610/ 125429 | consumed samples: 31900160 | consumed tokens: 65331527680 | elapsed time per iteration (s): 1.09 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.894352E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.271 | TFLOPs: 38.72 | 15: iteration 124620/ 125429 | consumed samples: 31902720 | consumed tokens: 65336770560 | elapsed time per iteration (s): 1.04 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.900500E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.968 | TFLOPs: 40.65 | 15: iteration 124630/ 125429 | consumed samples: 31905280 | consumed tokens: 65342013440 | elapsed time per iteration (s): 1.06 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.899237E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.821 | TFLOPs: 39.96 | 15: iteration 124640/ 125429 | consumed samples: 31907840 | consumed tokens: 65347256320 | elapsed time per iteration (s): 1.05 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.881320E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.056 | TFLOPs: 40.33 | 15: iteration 124650/ 125429 | consumed samples: 31910400 | consumed tokens: 65352499200 | elapsed time per iteration (s): 1.02 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.877830E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 251.015 | TFLOPs: 41.48 | 15: iteration 124660/ 125429 | consumed samples: 31912960 | consumed tokens: 65357742080 | elapsed time per iteration (s): 1.02 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.907037E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.167 | TFLOPs: 41.34 | 15: iteration 124670/ 125429 | consumed samples: 31915520 | consumed tokens: 65362984960 | elapsed time per iteration (s): 1.10 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.879738E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 232.864 | TFLOPs: 38.48 | 15: iteration 124680/ 125429 | consumed samples: 31918080 | consumed tokens: 65368227840 | elapsed time per iteration (s): 1.04 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.895196E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.856 | TFLOPs: 40.79 | 15: iteration 124690/ 125429 | consumed samples: 31920640 | consumed tokens: 65373470720 | elapsed time per iteration (s): 1.04 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.895924E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.998 | TFLOPs: 40.82 | 15: iteration 124700/ 125429 | consumed samples: 31923200 | consumed tokens: 65378713600 | elapsed time per iteration (s): 1.08 | learning rate: 2.002E-05 | global batch size: 256 | lm loss: 1.885211E+00 | grad norm: 0.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.625 | TFLOPs: 39.27 | 15: iteration 124710/ 125429 | consumed samples: 31925760 | consumed tokens: 65383956480 | elapsed time per iteration (s): 1.04 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.878942E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.265 | TFLOPs: 40.53 | 15: iteration 124720/ 125429 | consumed samples: 31928320 | consumed tokens: 65389199360 | elapsed time per iteration (s): 1.03 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.892854E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.112 | TFLOPs: 41.17 | 15: iteration 124730/ 125429 | consumed samples: 31930880 | consumed tokens: 65394442240 | elapsed time per iteration (s): 1.07 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.886740E+00 | grad norm: 0.178 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.562 | TFLOPs: 39.42 | 15: iteration 124740/ 125429 | consumed samples: 31933440 | consumed tokens: 65399685120 | elapsed time per iteration (s): 1.09 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.931829E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.136 | TFLOPs: 38.86 | 15: iteration 124750/ 125429 | consumed samples: 31936000 | consumed tokens: 65404928000 | elapsed time per iteration (s): 1.02 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.904476E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.202 | TFLOPs: 41.35 | 15: iteration 124760/ 125429 | consumed samples: 31938560 | consumed tokens: 65410170880 | elapsed time per iteration (s): 1.02 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.889204E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 250.643 | TFLOPs: 41.42 | 15: iteration 124770/ 125429 | consumed samples: 31941120 | consumed tokens: 65415413760 | elapsed time per iteration (s): 1.09 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.881105E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.305 | TFLOPs: 38.89 | 15: iteration 124780/ 125429 | consumed samples: 31943680 | consumed tokens: 65420656640 | elapsed time per iteration (s): 1.05 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.887580E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.363 | TFLOPs: 40.38 | 15: iteration 124790/ 125429 | consumed samples: 31946240 | consumed tokens: 65425899520 | elapsed time per iteration (s): 1.07 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.895524E+00 | grad norm: 0.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.081 | TFLOPs: 39.51 | 15: iteration 124800/ 125429 | consumed samples: 31948800 | consumed tokens: 65431142400 | elapsed time per iteration (s): 1.11 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.865073E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 230.899 | TFLOPs: 38.16 | 15: iteration 124810/ 125429 | consumed samples: 31951360 | consumed tokens: 65436385280 | elapsed time per iteration (s): 1.06 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.856822E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.376 | TFLOPs: 40.05 | 15: iteration 124820/ 125429 | consumed samples: 31953920 | consumed tokens: 65441628160 | elapsed time per iteration (s): 1.11 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.888321E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.112 | TFLOPs: 38.19 | 15: iteration 124830/ 125429 | consumed samples: 31956480 | consumed tokens: 65446871040 | elapsed time per iteration (s): 1.09 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.872783E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 235.808 | TFLOPs: 38.97 | 15: iteration 124840/ 125429 | consumed samples: 31959040 | consumed tokens: 65452113920 | elapsed time per iteration (s): 1.07 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.882531E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.629 | TFLOPs: 39.44 | 15: iteration 124850/ 125429 | consumed samples: 31961600 | consumed tokens: 65457356800 | elapsed time per iteration (s): 1.13 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.883687E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 225.782 | TFLOPs: 37.31 | 15: iteration 124860/ 125429 | consumed samples: 31964160 | consumed tokens: 65462599680 | elapsed time per iteration (s): 1.10 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.897521E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 231.677 | TFLOPs: 38.29 | 15: iteration 124870/ 125429 | consumed samples: 31966720 | consumed tokens: 65467842560 | elapsed time per iteration (s): 1.09 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.889232E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.960 | TFLOPs: 38.83 | 15: iteration 124880/ 125429 | consumed samples: 31969280 | consumed tokens: 65473085440 | elapsed time per iteration (s): 1.02 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.898068E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 252.029 | TFLOPs: 41.65 | 15: iteration 124890/ 125429 | consumed samples: 31971840 | consumed tokens: 65478328320 | elapsed time per iteration (s): 1.06 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.884106E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.500 | TFLOPs: 39.91 | 15: iteration 124900/ 125429 | consumed samples: 31974400 | consumed tokens: 65483571200 | elapsed time per iteration (s): 1.03 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.893285E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.413 | TFLOPs: 40.89 | 15: iteration 124910/ 125429 | consumed samples: 31976960 | consumed tokens: 65488814080 | elapsed time per iteration (s): 1.06 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.869078E+00 | grad norm: 0.181 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.799 | TFLOPs: 39.96 | 15: iteration 124920/ 125429 | consumed samples: 31979520 | consumed tokens: 65494056960 | elapsed time per iteration (s): 1.04 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.887239E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.189 | TFLOPs: 40.52 | 15: iteration 124930/ 125429 | consumed samples: 31982080 | consumed tokens: 65499299840 | elapsed time per iteration (s): 1.05 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.914132E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.261 | TFLOPs: 40.20 | 15: iteration 124940/ 125429 | consumed samples: 31984640 | consumed tokens: 65504542720 | elapsed time per iteration (s): 1.08 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.855665E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.037 | TFLOPs: 39.17 | 15: iteration 124950/ 125429 | consumed samples: 31987200 | consumed tokens: 65509785600 | elapsed time per iteration (s): 1.06 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.912792E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.385 | TFLOPs: 39.73 | 15: iteration 124960/ 125429 | consumed samples: 31989760 | consumed tokens: 65515028480 | elapsed time per iteration (s): 1.09 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.847265E+00 | grad norm: 0.159 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 234.416 | TFLOPs: 38.74 | 15: iteration 124970/ 125429 | consumed samples: 31992320 | consumed tokens: 65520271360 | elapsed time per iteration (s): 1.03 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.886649E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.477 | TFLOPs: 40.90 | 15: iteration 124980/ 125429 | consumed samples: 31994880 | consumed tokens: 65525514240 | elapsed time per iteration (s): 1.20 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.904297E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 212.633 | TFLOPs: 35.14 | 15: iteration 124990/ 125429 | consumed samples: 31997440 | consumed tokens: 65530757120 | elapsed time per iteration (s): 1.07 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.878650E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 238.535 | TFLOPs: 39.42 | 15: iteration 125000/ 125429 | consumed samples: 32000000 | consumed tokens: 65536000000 | elapsed time per iteration (s): 1.04 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.867876E+00 | grad norm: 0.158 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 246.793 | TFLOPs: 40.78 | 15: -------------------------------------------------------------------------------------------- 15: valid loss at iteration 125000 | lm loss value: 1.847398E+00 | lm loss PPL: 6.343295E+00 | 15: -------------------------------------------------------------------------------------------- 0: saving checkpoint at iteration 125000 to checkpoints_1b5 0: [2022-11-27 09:04:13,703] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step125000 is begin to save! 0: [2022-11-27 09:04:13,711] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_01-model_00-model_states.pt... 0: [2022-11-27 09:04:13,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_01-model_00-model_states.pt. 0: [2022-11-27 09:04:13,961] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_03-model_00-model_states.pt... 0: [2022-11-27 09:04:14,065] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_03-model_00-model_states.pt. 0: [2022-11-27 09:04:14,066] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_04-model_00-model_states.pt... 0: [2022-11-27 09:04:14,181] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_04-model_00-model_states.pt. 0: [2022-11-27 09:04:14,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_05-model_00-model_states.pt... 0: [2022-11-27 09:04:14,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_05-model_00-model_states.pt. 0: [2022-11-27 09:04:14,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_06-model_00-model_states.pt... 0: [2022-11-27 09:04:14,408] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_06-model_00-model_states.pt. 0: [2022-11-27 09:04:14,408] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_07-model_00-model_states.pt... 0: [2022-11-27 09:04:14,521] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_07-model_00-model_states.pt. 0: [2022-11-27 09:04:14,522] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_08-model_00-model_states.pt... 0: [2022-11-27 09:04:14,630] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_08-model_00-model_states.pt. 0: [2022-11-27 09:04:14,631] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_09-model_00-model_states.pt... 0: [2022-11-27 09:04:14,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_09-model_00-model_states.pt. 0: [2022-11-27 09:04:14,741] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_10-model_00-model_states.pt... 0: [2022-11-27 09:04:14,848] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_10-model_00-model_states.pt. 0: [2022-11-27 09:04:14,849] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_11-model_00-model_states.pt... 0: [2022-11-27 09:04:14,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_11-model_00-model_states.pt. 0: [2022-11-27 09:04:14,962] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_12-model_00-model_states.pt... 0: [2022-11-27 09:04:15,073] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_12-model_00-model_states.pt. 0: [2022-11-27 09:04:15,074] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_13-model_00-model_states.pt... 0: [2022-11-27 09:04:15,180] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_13-model_00-model_states.pt. 0: [2022-11-27 09:04:15,181] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_14-model_00-model_states.pt... 0: [2022-11-27 09:04:15,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_14-model_00-model_states.pt. 0: [2022-11-27 09:04:15,295] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_15-model_00-model_states.pt... 0: [2022-11-27 09:04:15,400] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_15-model_00-model_states.pt. 0: [2022-11-27 09:04:15,400] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_16-model_00-model_states.pt... 0: [2022-11-27 09:04:15,508] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_16-model_00-model_states.pt. 0: [2022-11-27 09:04:15,508] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_17-model_00-model_states.pt... 0: [2022-11-27 09:04:15,615] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_17-model_00-model_states.pt. 0: [2022-11-27 09:04:15,615] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_18-model_00-model_states.pt... 0: [2022-11-27 09:04:15,724] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_18-model_00-model_states.pt. 0: [2022-11-27 09:04:15,724] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_19-model_00-model_states.pt... 0: [2022-11-27 09:04:15,827] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_19-model_00-model_states.pt. 0: [2022-11-27 09:04:15,827] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_20-model_00-model_states.pt... 0: [2022-11-27 09:04:15,933] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_20-model_00-model_states.pt. 0: [2022-11-27 09:04:15,934] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_21-model_00-model_states.pt... 0: [2022-11-27 09:04:16,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_21-model_00-model_states.pt. 0: [2022-11-27 09:04:16,039] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_22-model_00-model_states.pt... 0: [2022-11-27 09:04:16,146] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_22-model_00-model_states.pt. 0: [2022-11-27 09:04:16,146] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_23-model_00-model_states.pt... 0: [2022-11-27 09:04:16,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_23-model_00-model_states.pt. 0: [2022-11-27 09:04:16,255] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_24-model_00-model_states.pt... 0: [2022-11-27 09:04:16,374] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_24-model_00-model_states.pt. 0: [2022-11-27 09:04:16,375] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_25-model_00-model_states.pt... 0: [2022-11-27 09:04:16,482] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_25-model_00-model_states.pt. 0: [2022-11-27 09:04:16,483] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_26-model_00-model_states.pt... 0: [2022-11-27 09:04:16,590] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_26-model_00-model_states.pt. 0: [2022-11-27 09:04:16,590] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_27-model_00-model_states.pt... 0: [2022-11-27 09:04:16,700] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_27-model_00-model_states.pt. 0: [2022-11-27 09:04:16,701] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_28-model_00-model_states.pt... 0: [2022-11-27 09:04:16,809] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_28-model_00-model_states.pt. 0: [2022-11-27 09:04:16,810] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_29-model_00-model_states.pt... 0: [2022-11-27 09:04:16,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_29-model_00-model_states.pt. 0: [2022-11-27 09:04:16,917] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_30-model_00-model_states.pt... 0: [2022-11-27 09:04:17,024] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_30-model_00-model_states.pt. 0: [2022-11-27 09:04:17,024] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/layer_32-model_00-model_states.pt... 0: [2022-11-27 09:04:17,028] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/layer_32-model_00-model_states.pt. 0: [2022-11-27 09:04:17,029] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step125000/mp_rank_00_model_states.pt 0: [2022-11-27 09:04:17,029] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/mp_rank_00_model_states.pt... 0: [2022-11-27 09:04:17,032] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/mp_rank_00_model_states.pt. 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:04:17,072] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:04:17,230] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,230] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,230] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,234] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,234] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 09:04:17,234] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,236] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,236] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,236] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-27 09:04:17,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,231] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:04:17,231] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,231] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-27 09:04:17,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,233] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:04:17,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,233] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-27 09:04:17,233] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,239] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 09:04:17,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,242] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:04:17,242] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,242] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,243] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 2: [2022-11-27 09:04:17,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-27 09:04:17,243] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,245] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,246] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,246] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 09:04:17,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 09:04:17,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,239] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,239] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:04:17,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,253] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 09:04:17,253] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:04:17,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,243] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-27 09:04:17,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,244] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,244] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,245] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 09:04:17,245] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:04:17,251] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 09:04:17,251] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:04:17,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,254] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:04:17,254] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:04:17,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 09:04:17,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-27 09:04:17,256] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,256] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,257] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,248] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,248] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 09:04:17,248] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,249] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 09:04:17,249] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,252] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 09:04:17,252] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 09:04:17,255] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,255] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,235] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,235] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,237] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:04:17,237] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,237] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,238] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:04:17,238] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,238] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:04:17,257] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:04:17,258] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:04:17,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,258] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,258] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,260] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,260] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 09:04:17,260] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 09:04:17,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,262] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,263] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,263] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,264] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-27 09:04:17,264] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 2: [2022-11-27 09:04:17,266] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:04:17,266] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 09:04:17,266] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:04:17,270] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:04:17,262] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:04:17,262] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,259] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 11: [2022-11-27 09:04:17,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,259] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,268] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:04:17,269] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 09:04:17,269] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,271] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:04:17,271] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,271] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-27 09:04:17,272] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,272] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,272] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-27 09:04:17,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,273] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 15: [2022-11-27 09:04:17,267] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:04:17,267] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 09:04:17,267] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:04:17,276] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,276] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:04:17,276] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 5: [2022-11-27 09:04:17,277] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:04:17,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 09:04:17,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 11: [2022-11-27 09:04:17,277] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 09:04:17,277] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,265] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,265] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 09:04:17,265] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 14: [2022-11-27 09:04:17,273] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:04:17,273] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 09:04:17,274] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:04:17,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 13: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 3: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:04:17,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 12: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:04:17,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,280] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 09:04:17,280] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,281] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 09:04:17,281] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,270] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,270] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 1: [2022-11-27 09:04:17,284] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:04:17,284] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 09:04:17,284] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 7: [2022-11-27 09:04:17,285] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:04:17,286] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 09:04:17,286] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 8: [2022-11-27 09:04:17,294] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:04:17,294] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 09:04:17,294] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,298] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,298] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,298] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,299] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,299] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,299] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,300] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,300] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,300] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,303] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,303] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,304] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,304] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 9: [2022-11-27 09:04:17,312] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:04:17,312] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 09:04:17,312] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,314] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:04:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:04:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 6: [2022-11-27 09:04:17,315] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:04:17,315] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 09:04:17,315] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,330] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:04:17,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 09:04:17,331] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 09:04:17,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,331] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,331] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 10: [2022-11-27 09:04:17,391] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: [2022-11-27 09:04:17,437] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 09:04:17,437] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 4: [2022-11-27 09:04:17,485] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125000/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 09:04:17,485] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125000 is ready now! 0: successfully saved checkpoint at iteration 125000 to checkpoints_1b5 15: time (ms) | save-checkpoint: 3809.41 15: iteration 125010/ 125429 | consumed samples: 32002560 | consumed tokens: 65541242880 | elapsed time per iteration (s): 1.61 | learning rate: 2.001E-05 | global batch size: 256 | lm loss: 1.883145E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 159.110 | TFLOPs: 26.29 | 15: iteration 125020/ 125429 | consumed samples: 32005120 | consumed tokens: 65546485760 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.923085E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.018 | TFLOPs: 40.99 | 15: iteration 125030/ 125429 | consumed samples: 32007680 | consumed tokens: 65551728640 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.887496E+00 | grad norm: 0.165 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.510 | TFLOPs: 41.23 | 15: iteration 125040/ 125429 | consumed samples: 32010240 | consumed tokens: 65556971520 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.873862E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.685 | TFLOPs: 40.11 | 15: iteration 125050/ 125429 | consumed samples: 32012800 | consumed tokens: 65562214400 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.860543E+00 | grad norm: 0.177 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.042 | TFLOPs: 41.16 | 15: iteration 125060/ 125429 | consumed samples: 32015360 | consumed tokens: 65567457280 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.883977E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.546 | TFLOPs: 40.58 | 15: iteration 125070/ 125429 | consumed samples: 32017920 | consumed tokens: 65572700160 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.842037E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.956 | TFLOPs: 41.14 | 15: iteration 125080/ 125429 | consumed samples: 32020480 | consumed tokens: 65577943040 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.895420E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.543 | TFLOPs: 40.91 | 15: iteration 125090/ 125429 | consumed samples: 32023040 | consumed tokens: 65583185920 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.859513E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.975 | TFLOPs: 40.32 | 15: iteration 125100/ 125429 | consumed samples: 32025600 | consumed tokens: 65588428800 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.910331E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 240.939 | TFLOPs: 39.82 | 15: iteration 125110/ 125429 | consumed samples: 32028160 | consumed tokens: 65593671680 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.875723E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 236.062 | TFLOPs: 39.01 | 15: iteration 125120/ 125429 | consumed samples: 32030720 | consumed tokens: 65598914560 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.906273E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.288 | TFLOPs: 40.21 | 15: iteration 125130/ 125429 | consumed samples: 32033280 | consumed tokens: 65604157440 | elapsed time per iteration (s): 1.18 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.895943E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 217.232 | TFLOPs: 35.90 | 15: iteration 125140/ 125429 | consumed samples: 32035840 | consumed tokens: 65609400320 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.891060E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.730 | TFLOPs: 41.10 | 15: iteration 125150/ 125429 | consumed samples: 32038400 | consumed tokens: 65614643200 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.882355E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.739 | TFLOPs: 40.11 | 15: iteration 125160/ 125429 | consumed samples: 32040960 | consumed tokens: 65619886080 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.875784E+00 | grad norm: 0.171 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.800 | TFLOPs: 41.12 | 15: iteration 125170/ 125429 | consumed samples: 32043520 | consumed tokens: 65625128960 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.868817E+00 | grad norm: 0.170 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.030 | TFLOPs: 40.00 | 15: iteration 125180/ 125429 | consumed samples: 32046080 | consumed tokens: 65630371840 | elapsed time per iteration (s): 1.13 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.899530E+00 | grad norm: 0.173 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 226.767 | TFLOPs: 37.48 | 15: iteration 125190/ 125429 | consumed samples: 32048640 | consumed tokens: 65635614720 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.909478E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.214 | TFLOPs: 41.02 | 15: iteration 125200/ 125429 | consumed samples: 32051200 | consumed tokens: 65640857600 | elapsed time per iteration (s): 1.08 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.893110E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 237.655 | TFLOPs: 39.27 | 15: iteration 125210/ 125429 | consumed samples: 32053760 | consumed tokens: 65646100480 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.877429E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.708 | TFLOPs: 40.94 | 15: iteration 125220/ 125429 | consumed samples: 32056320 | consumed tokens: 65651343360 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.862643E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.908 | TFLOPs: 40.14 | 15: iteration 125230/ 125429 | consumed samples: 32058880 | consumed tokens: 65656586240 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.890655E+00 | grad norm: 0.176 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.322 | TFLOPs: 40.54 | 15: iteration 125240/ 125429 | consumed samples: 32061440 | consumed tokens: 65661829120 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.891076E+00 | grad norm: 0.175 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.064 | TFLOPs: 40.99 | 15: iteration 125250/ 125429 | consumed samples: 32064000 | consumed tokens: 65667072000 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.888564E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 241.746 | TFLOPs: 39.95 | 15: iteration 125260/ 125429 | consumed samples: 32066560 | consumed tokens: 65672314880 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.853872E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.625 | TFLOPs: 40.92 | 15: iteration 125270/ 125429 | consumed samples: 32069120 | consumed tokens: 65677557760 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.882951E+00 | grad norm: 0.161 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.064 | TFLOPs: 40.17 | 15: iteration 125280/ 125429 | consumed samples: 32071680 | consumed tokens: 65682800640 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.874642E+00 | grad norm: 0.169 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.352 | TFLOPs: 40.05 | 15: iteration 125290/ 125429 | consumed samples: 32074240 | consumed tokens: 65688043520 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.897277E+00 | grad norm: 0.168 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.612 | TFLOPs: 41.25 | 15: iteration 125300/ 125429 | consumed samples: 32076800 | consumed tokens: 65693286400 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.910566E+00 | grad norm: 0.179 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.458 | TFLOPs: 40.40 | 15: iteration 125310/ 125429 | consumed samples: 32079360 | consumed tokens: 65698529280 | elapsed time per iteration (s): 1.04 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.916276E+00 | grad norm: 0.198 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 245.118 | TFLOPs: 40.51 | 15: iteration 125320/ 125429 | consumed samples: 32081920 | consumed tokens: 65703772160 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.907286E+00 | grad norm: 0.166 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.565 | TFLOPs: 40.09 | 15: iteration 125330/ 125429 | consumed samples: 32084480 | consumed tokens: 65709015040 | elapsed time per iteration (s): 1.06 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.875440E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 242.473 | TFLOPs: 40.07 | 15: iteration 125340/ 125429 | consumed samples: 32087040 | consumed tokens: 65714257920 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.882297E+00 | grad norm: 0.160 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.772 | TFLOPs: 40.45 | 15: iteration 125350/ 125429 | consumed samples: 32089600 | consumed tokens: 65719500800 | elapsed time per iteration (s): 1.07 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.890926E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 239.089 | TFLOPs: 39.51 | 15: iteration 125360/ 125429 | consumed samples: 32092160 | consumed tokens: 65724743680 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.888505E+00 | grad norm: 0.162 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 249.003 | TFLOPs: 41.15 | 15: iteration 125370/ 125429 | consumed samples: 32094720 | consumed tokens: 65729986560 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.887475E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.544 | TFLOPs: 40.91 | 15: iteration 125380/ 125429 | consumed samples: 32097280 | consumed tokens: 65735229440 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.909990E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 244.285 | TFLOPs: 40.37 | 15: iteration 125390/ 125429 | consumed samples: 32099840 | consumed tokens: 65740472320 | elapsed time per iteration (s): 1.05 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.879000E+00 | grad norm: 0.182 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 243.639 | TFLOPs: 40.26 | 15: iteration 125400/ 125429 | consumed samples: 32102400 | consumed tokens: 65745715200 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.885600E+00 | grad norm: 0.163 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 248.856 | TFLOPs: 41.13 | 15: iteration 125410/ 125429 | consumed samples: 32104960 | consumed tokens: 65750958080 | elapsed time per iteration (s): 1.03 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.854196E+00 | grad norm: 0.167 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 247.731 | TFLOPs: 40.94 | 15: iteration 125420/ 125429 | consumed samples: 32107520 | consumed tokens: 65756200960 | elapsed time per iteration (s): 1.09 | learning rate: 2.000E-05 | global batch size: 256 | lm loss: 1.885809E+00 | grad norm: 0.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | samples per second: 233.902 | TFLOPs: 38.65 | 0: [after training is done] datetime: 2022-11-27 09:11:50 0: saving checkpoint at iteration 125429 to checkpoints_1b5 15: ------------------------------------------------------------------------------------------------------------ 15: valid loss at the end of training for val data | lm loss value: 1.909358E+00 | lm loss PPL: 6.748754E+00 | 15: ------------------------------------------------------------------------------------------------------------ 0: [2022-11-27 09:11:50,239] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step125429 is begin to save! 0: [2022-11-27 09:11:50,245] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_01-model_00-model_states.pt... 0: [2022-11-27 09:11:50,504] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_01-model_00-model_states.pt. 0: [2022-11-27 09:11:50,504] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_03-model_00-model_states.pt... 0: [2022-11-27 09:11:50,612] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_03-model_00-model_states.pt. 0: [2022-11-27 09:11:50,613] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_04-model_00-model_states.pt... 0: [2022-11-27 09:11:50,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_04-model_00-model_states.pt. 0: [2022-11-27 09:11:50,732] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_05-model_00-model_states.pt... 0: [2022-11-27 09:11:50,847] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_05-model_00-model_states.pt. 0: [2022-11-27 09:11:50,847] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_06-model_00-model_states.pt... 0: [2022-11-27 09:11:50,952] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_06-model_00-model_states.pt. 0: [2022-11-27 09:11:50,952] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_07-model_00-model_states.pt... 0: [2022-11-27 09:11:51,061] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_07-model_00-model_states.pt. 0: [2022-11-27 09:11:51,062] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_08-model_00-model_states.pt... 0: [2022-11-27 09:11:51,167] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_08-model_00-model_states.pt. 0: [2022-11-27 09:11:51,168] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_09-model_00-model_states.pt... 0: [2022-11-27 09:11:51,275] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_09-model_00-model_states.pt. 0: [2022-11-27 09:11:51,276] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_10-model_00-model_states.pt... 0: [2022-11-27 09:11:51,382] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_10-model_00-model_states.pt. 0: [2022-11-27 09:11:51,382] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_11-model_00-model_states.pt... 0: [2022-11-27 09:11:51,489] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_11-model_00-model_states.pt. 0: [2022-11-27 09:11:51,490] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_12-model_00-model_states.pt... 0: [2022-11-27 09:11:51,600] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_12-model_00-model_states.pt. 0: [2022-11-27 09:11:51,600] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_13-model_00-model_states.pt... 0: [2022-11-27 09:11:51,703] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_13-model_00-model_states.pt. 0: [2022-11-27 09:11:51,704] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_14-model_00-model_states.pt... 0: [2022-11-27 09:11:51,811] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_14-model_00-model_states.pt. 0: [2022-11-27 09:11:51,811] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_15-model_00-model_states.pt... 0: [2022-11-27 09:11:51,912] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_15-model_00-model_states.pt. 0: [2022-11-27 09:11:51,913] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_16-model_00-model_states.pt... 0: [2022-11-27 09:11:52,021] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_16-model_00-model_states.pt. 0: [2022-11-27 09:11:52,022] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_17-model_00-model_states.pt... 0: [2022-11-27 09:11:52,130] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_17-model_00-model_states.pt. 0: [2022-11-27 09:11:52,131] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_18-model_00-model_states.pt... 0: [2022-11-27 09:11:52,232] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_18-model_00-model_states.pt. 0: [2022-11-27 09:11:52,232] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_19-model_00-model_states.pt... 0: [2022-11-27 09:11:52,357] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_19-model_00-model_states.pt. 0: [2022-11-27 09:11:52,357] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_20-model_00-model_states.pt... 0: [2022-11-27 09:11:52,457] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_20-model_00-model_states.pt. 0: [2022-11-27 09:11:52,457] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_21-model_00-model_states.pt... 0: [2022-11-27 09:11:52,559] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_21-model_00-model_states.pt. 0: [2022-11-27 09:11:52,559] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_22-model_00-model_states.pt... 0: [2022-11-27 09:11:52,661] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_22-model_00-model_states.pt. 0: [2022-11-27 09:11:52,662] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_23-model_00-model_states.pt... 0: [2022-11-27 09:11:52,766] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_23-model_00-model_states.pt. 0: [2022-11-27 09:11:52,766] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_24-model_00-model_states.pt... 0: [2022-11-27 09:11:52,870] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_24-model_00-model_states.pt. 0: [2022-11-27 09:11:52,871] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_25-model_00-model_states.pt... 0: [2022-11-27 09:11:52,974] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_25-model_00-model_states.pt. 0: [2022-11-27 09:11:52,974] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_26-model_00-model_states.pt... 0: [2022-11-27 09:11:53,076] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_26-model_00-model_states.pt. 0: [2022-11-27 09:11:53,076] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_27-model_00-model_states.pt... 0: [2022-11-27 09:11:53,178] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_27-model_00-model_states.pt. 0: [2022-11-27 09:11:53,179] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_28-model_00-model_states.pt... 0: [2022-11-27 09:11:53,281] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_28-model_00-model_states.pt. 0: [2022-11-27 09:11:53,282] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_29-model_00-model_states.pt... 0: [2022-11-27 09:11:53,383] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_29-model_00-model_states.pt. 0: [2022-11-27 09:11:53,384] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_30-model_00-model_states.pt... 0: [2022-11-27 09:11:53,485] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_30-model_00-model_states.pt. 0: [2022-11-27 09:11:53,486] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/layer_32-model_00-model_states.pt... 0: [2022-11-27 09:11:53,490] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/layer_32-model_00-model_states.pt. 0: [2022-11-27 09:11:53,492] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: checkpoints_1b5/global_step125429/mp_rank_00_model_states.pt 0: [2022-11-27 09:11:53,492] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/mp_rank_00_model_states.pt... 0: [2022-11-27 09:11:53,497] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/mp_rank_00_model_states.pt. 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt... 1: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt... 10: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt... 5: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt... 9: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt... 14: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt... 4: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt... 7: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt... 6: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt... 13: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt... 2: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt... 3: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt... 12: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt... 8: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt... 15: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt... 11: [2022-11-27 09:11:53,535] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving checkpoints_1b5/global_step125429/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt... 0: [2022-11-27 09:11:53,699] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,701] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,701] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,701] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_77_mp_rank_00_optim_states.pt 9: [2022-11-27 09:11:53,702] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,702] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,702] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_40_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,703] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_100_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_113_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,704] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,704] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,704] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_80_mp_rank_00_optim_states.pt 10: [2022-11-27 09:11:53,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: [2022-11-27 09:11:53,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,705] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,705] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:11:53,705] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_63_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,706] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: [2022-11-27 09:11:53,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,708] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_79_mp_rank_00_optim_states.pt 9: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_78_mp_rank_00_optim_states.pt 9: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 3: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:11:53,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_57_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 6: [2022-11-27 09:11:53,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_48_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_50_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 6: [2022-11-27 09:11:53,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,712] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,712] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_46_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,712] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,715] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 3: [2022-11-27 09:11:53,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 3: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 4: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,709] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_37_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,709] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 4: [2022-11-27 09:11:53,715] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,715] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_35_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:11:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_58_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,716] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_119_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,716] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_86_mp_rank_00_optim_states.pt 10: [2022-11-27 09:11:53,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_84_mp_rank_00_optim_states.pt 10: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_43_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_45_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_44_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_41_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_87_mp_rank_00_optim_states.pt 10: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 3: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_85_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 6: [2022-11-27 09:11:53,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_52_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_54_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,719] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_55_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 6: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 6: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:11:53,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_59_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:11:53,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_61_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_75_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_62_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_56_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_117_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_103_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_71_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_69_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,725] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_64_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,725] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 7: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt. 7: [2022-11-27 09:11:53,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_60_mp_rank_00_optim_states.pt 7: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,713] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,713] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_108_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,713] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,717] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_111_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,717] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,721] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_107_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,721] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_105_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_110_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_66_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_65_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_68_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,726] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_67_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,726] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_109_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_114_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,728] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,728] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_115_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,728] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_42_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 8: [2022-11-27 09:11:53,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt. 8: [2022-11-27 09:11:53,729] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_70_mp_rank_00_optim_states.pt 8: [2022-11-27 09:11:53,729] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,729] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_73_mp_rank_00_optim_states.pt 9: [2022-11-27 09:11:53,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_101_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,730] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,730] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_82_mp_rank_00_optim_states.pt 10: [2022-11-27 09:11:53,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,730] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_74_mp_rank_00_optim_states.pt 9: [2022-11-27 09:11:53,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 3: [2022-11-27 09:11:53,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_81_mp_rank_00_optim_states.pt 10: [2022-11-27 09:11:53,731] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 6: [2022-11-27 09:11:53,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_49_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,734] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,734] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_102_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: [2022-11-27 09:11:53,735] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,735] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,735] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,708] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_89_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,708] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,720] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,720] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_90_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,720] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,723] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,723] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_95_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,723] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_94_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,736] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_92_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,736] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_93_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,736] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,737] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,737] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,737] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_99_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,738] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_97_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,738] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,738] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt. 12: [2022-11-27 09:11:53,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_98_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,739] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_96_mp_rank_00_optim_states.pt 12: [2022-11-27 09:11:53,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 12: [2022-11-27 09:11:53,739] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_53_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 5: [2022-11-27 09:11:53,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt. 5: [2022-11-27 09:11:53,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_47_mp_rank_00_optim_states.pt 5: [2022-11-27 09:11:53,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 6: [2022-11-27 09:11:53,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt. 6: [2022-11-27 09:11:53,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_51_mp_rank_00_optim_states.pt 6: [2022-11-27 09:11:53,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: [2022-11-27 09:11:53,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,744] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,744] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: [2022-11-27 09:11:53,744] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt. 0: [2022-11-27 09:11:53,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,732] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_104_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 13: [2022-11-27 09:11:53,743] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt. 13: [2022-11-27 09:11:53,743] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_106_mp_rank_00_optim_states.pt 13: [2022-11-27 09:11:53,743] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,741] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_112_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,741] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_116_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 14: [2022-11-27 09:11:53,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt. 14: [2022-11-27 09:11:53,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_118_mp_rank_00_optim_states.pt 14: [2022-11-27 09:11:53,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 2: [2022-11-27 09:11:53,752] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt. 2: [2022-11-27 09:11:53,752] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt 2: [2022-11-27 09:11:53,752] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: [2022-11-27 09:11:53,759] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt 0: [2022-11-27 09:11:53,759] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_91_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 11: [2022-11-27 09:11:53,745] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt. 11: [2022-11-27 09:11:53,745] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_88_mp_rank_00_optim_states.pt 11: [2022-11-27 09:11:53,745] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 10: [2022-11-27 09:11:53,760] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt. 10: [2022-11-27 09:11:53,760] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_83_mp_rank_00_optim_states.pt 10: [2022-11-27 09:11:53,760] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_76_mp_rank_00_optim_states.pt 9: [2022-11-27 09:11:53,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 9: [2022-11-27 09:11:53,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt. 9: [2022-11-27 09:11:53,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_72_mp_rank_00_optim_states.pt 9: [2022-11-27 09:11:53,761] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,740] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,740] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 3: [2022-11-27 09:11:53,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,761] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt. 3: [2022-11-27 09:11:53,761] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt 3: [2022-11-27 09:11:53,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 3: [2022-11-27 09:11:53,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,762] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,718] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,718] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_38_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,719] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 4: [2022-11-27 09:11:53,722] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,722] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_39_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,722] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 4: [2022-11-27 09:11:53,727] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,727] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_33_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,727] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 4: [2022-11-27 09:11:53,731] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,731] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_32_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,732] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 4: [2022-11-27 09:11:53,741] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_34_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 4: [2022-11-27 09:11:53,742] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt. 4: [2022-11-27 09:11:53,742] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_36_mp_rank_00_optim_states.pt 4: [2022-11-27 09:11:53,742] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,762] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,762] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,764] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:11:53,764] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 1: [2022-11-27 09:11:53,765] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt. 1: [2022-11-27 09:11:53,765] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt 1: [2022-11-27 09:11:53,765] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt. 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_120_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_123_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_122_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_124_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_126_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_127_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_121_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [engine.py:3213:_save_zero_checkpoint] bf16_zero checkpoint saved checkpoints_1b5/global_step125429/bf16_zero_pp_rank_125_mp_rank_00_optim_states.pt 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 15: [2022-11-27 09:11:53,916] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step125429 is ready now! 0: successfully saved checkpoint at iteration 125429 to checkpoints_1b5 15: ------------------------------------------------------------------------------------------------------------ 15: test loss at the end of training for test data | lm loss value: 1.852129E+00 | lm loss PPL: 6.373371E+00 | 15: ------------------------------------------------------------------------------------------------------------ END 2072488: Sun Nov 27 09:12:10 EET 2022